Performance PDF
Performance PDF
Performance management
IBM
Note
Before using this information and the product it supports, read the information in “Notices” on page
443.
This edition applies to AIX Version 7.3 and to all subsequent releases and modifications until otherwise indicated in new
editions.
© Copyright International Business Machines Corporation 2021, 2024.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM Corp.
Contents
Performance management.....................................................................................1
What's new................................................................................................................................................... 1
The basics of performance.......................................................................................................................... 1
System workload.................................................................................................................................... 1
Performance objectives..........................................................................................................................2
Program execution model...................................................................................................................... 3
Hardware hierarchy ............................................................................................................................... 3
Software hierarchy .................................................................................................................................5
System tuning......................................................................................................................................... 6
Performance tuning......................................................................................................................................7
The performance-tuning process ..........................................................................................................7
Performance benchmarking ................................................................................................................11
System performance monitoring...............................................................................................................12
Continuous system-performance monitoring advantages..................................................................12
Continuous system-performance monitoring with commands.......................................................... 13
Continuous system-performance monitoring with the topas command............................................15
Continuous system-performance monitoring using Performance Management (PM) service.......... 26
Initial performance diagnosis....................................................................................................................27
Types of reported performance problems ..........................................................................................27
Performance-Limiting Resource identification....................................................................................30
Workload management diagnosis....................................................................................................... 35
Resource management..............................................................................................................................35
Processor scheduler performance.......................................................................................................36
Virtual Memory Manager performance................................................................................................ 42
Fixed-disk storage management performance................................................................................... 49
Support for pinned memory ................................................................................................................ 50
Multiprocessing..........................................................................................................................................51
Symmetrical Multiprocessor concepts and architecture ................................................................... 51
SMP performance issues .....................................................................................................................57
SMP workloads .................................................................................................................................... 58
SMP thread scheduling ........................................................................................................................61
Thread tuning .......................................................................................................................................63
SMP tools ............................................................................................................................................. 69
Performance planning and implementation .............................................................................................71
Workload component identification.................................................................................................... 72
Performance requirements documentation........................................................................................ 72
Workload resource requirements estimation......................................................................................73
Efficient Program Design and Implementation................................................................................... 79
Performance-related installation guidelines ...................................................................................... 86
POWER4-based systems........................................................................................................................... 90
POWER4 performance enhancements................................................................................................ 90
POWER4-based systems scalability enhancements...........................................................................90
64-bit kernel......................................................................................................................................... 91
Enhanced Journaled File System.........................................................................................................92
Microprocessor performance.................................................................................................................... 92
iii
Microprocessor performance monitoring............................................................................................ 92
Using the time command to measure microprocessor use ............................................................. 100
Microprocessor-intensive program identification............................................................................. 101
Using the pprof command to measure microprocessor usage of kernel threads ........................... 104
Detecting instruction emulation with the emstat tool...................................................................... 106
Detecting alignment exceptions with the alstat tool ........................................................................107
Restructuring executable programs with the fdpr program............................................................. 107
Controlling contention for the microprocessor................................................................................. 109
Microprocessor-efficient user ID administration with the mkpasswd command............................114
Memory performance.............................................................................................................................. 114
Memory usage.................................................................................................................................... 115
Memory-leaking programs ................................................................................................................127
Memory requirements assessment with the rmss command ......................................................... 128
VMM memory load control tuning with the schedo command ........................................................ 133
VMM page replacement tuning.......................................................................................................... 137
Page space allocation........................................................................................................................ 140
Paging-space thresholds tuning........................................................................................................ 141
Paging space garbage collection....................................................................................................... 143
Shared memory ................................................................................................................................. 145
AIX memory affinity support............................................................................................................. 147
Large pages........................................................................................................................................ 149
Multiple page size support.................................................................................................................152
VMM thread interrupt offload............................................................................................................ 161
Logical volume and disk I/O performance.............................................................................................. 161
Monitoring disk I/O............................................................................................................................ 162
LVM performance monitoring with the lvmstat command................................................................182
Logical volume attributes that affect performance...........................................................................184
LVM performance tuning with the lvmo command........................................................................... 187
Physical volume considerations ....................................................................................................... 188
Volume group recommendations ..................................................................................................... 188
Reorganizing logical volumes ............................................................................................................189
Tuning logical volume striping .......................................................................................................... 190
Using raw disk I/O ............................................................................................................................. 193
Using sync and fsync calls .................................................................................................................193
Setting SCSI-adapter and disk-device queue limits......................................................................... 193
Expanding the configuration ............................................................................................................. 194
Using RAID .........................................................................................................................................195
Fast write cache use...........................................................................................................................195
Fast I/O Failure for Fibre Channel devices........................................................................................ 196
Dynamic Tracking of Fibre Channel devices......................................................................................196
Fast I/O Failure and dynamic tracking interaction............................................................................ 199
Modular I/O..............................................................................................................................................200
Cautions and benefits........................................................................................................................ 200
MIO architecture................................................................................................................................ 200
I/O optimization and the pf module.................................................................................................. 201
MIO implementation.......................................................................................................................... 201
MIO environmental variables.............................................................................................................202
Module options definitions................................................................................................................ 204
Examples using MIO.......................................................................................................................... 208
File system performance......................................................................................................................... 214
File system types................................................................................................................................214
Potential performance inhibitors for JFS and Enhanced JFS........................................................... 218
File system performance enhancements.......................................................................................... 218
File system attributes that affect performance.................................................................................220
File system reorganization................................................................................................................. 222
File system performance tuning........................................................................................................ 224
File system logs and log logical volumes reorganization.................................................................. 232
Disk I/O pacing................................................................................................................................... 233
iv
Network performance............................................................................................................................. 235
TCP and UDP performance tuning..................................................................................................... 235
Tuning mbuf pool performance .........................................................................................................266
ARP cache tuning............................................................................................................................... 268
Name resolution tuning......................................................................................................................270
Network performance analysis..........................................................................................................270
NFS performance.....................................................................................................................................299
Network File Systems........................................................................................................................ 299
NFS performance monitoring and tuning.......................................................................................... 304
NFS performance monitoring on the server...................................................................................... 310
NFS performance tuning on the server............................................................................................. 311
NFS performance monitoring on the client....................................................................................... 313
NFS tuning on the client.....................................................................................................................315
Cache file system............................................................................................................................... 319
NFS references .................................................................................................................................. 322
LPAR performance................................................................................................................................... 325
Performance considerations with logical partitioning...................................................................... 325
Workload management in a partition................................................................................................ 326
LPAR performance impacts............................................................................................................... 327
Microprocessors in a partition........................................................................................................... 328
Virtual processor management within a partition.............................................................................328
Application considerations................................................................................................................ 329
Dynamic logical partitioning....................................................................................................................331
DLPAR performance implications...................................................................................................... 331
DLPAR tuning tools.............................................................................................................................332
DLPAR guidelines for adding microprocessors or memory...............................................................332
Micro-Partitioning.................................................................................................................................... 333
Micro-Partitioning facts..................................................................................................................... 333
Implementation of Micro-Partitioning...............................................................................................333
Micro-Partitioning performance implications................................................................................... 334
Active Memory Expansion (AME)............................................................................................................ 334
Application Tuning................................................................................................................................... 344
Compiler optimization techniques ....................................................................................................344
Optimizing preprocessors for FORTRAN and C ................................................................................ 352
Code-optimization techniques ..........................................................................................................352
Java performance monitoring................................................................................................................. 354
Advantages of Java............................................................................................................................ 354
Java performance guidelines.............................................................................................................354
Java monitoring tools.........................................................................................................................355
Java tuning for AIX.............................................................................................................................355
Garbage collection impacts to Java performance.............................................................................356
Performance analysis with the trace facility...........................................................................................356
The trace facility in detail...................................................................................................................357
Trace facility use example................................................................................................................. 359
Starting and controlling trace from the command line .................................................................... 361
Starting and controlling trace from a program ................................................................................. 362
Using the trcrpt command to format a report .................................................................................. 362
Adding new trace events ...................................................................................................................364
Reporting performance problems........................................................................................................... 368
Measuring the baseline ..................................................................................................................... 368
What is a performance problem........................................................................................................ 369
Performance problem description .................................................................................................... 369
Reporting a performance problem ................................................................................................... 369
Monitoring and tuning commands and subroutines............................................................................... 371
Performance reporting and analysis commands ..............................................................................371
Performance tuning commands ........................................................................................................374
Performance-related subroutines .................................................................................................... 375
Efficient use of the ld command..............................................................................................................375
v
Rebindable executable programs ..................................................................................................... 376
Prebound subroutine libraries .......................................................................................................... 376
Accessing the processor timer................................................................................................................ 377
POWER-based-architecture-unique timer access ........................................................................... 378
Access to timer registers in PowerPC systems ................................................................................ 379
Second subroutine example.............................................................................................................. 379
Determining microprocessor speed........................................................................................................380
National language support: locale versus speed....................................................................................382
Programming considerations.............................................................................................................383
Some simplifying rules.......................................................................................................................383
Setting the locale............................................................................................................................... 384
Tunable parameters................................................................................................................................ 384
Environment variables ...................................................................................................................... 384
Kernel tunable parameters................................................................................................................ 407
Network tunable parameters.............................................................................................................421
Test case scenarios..................................................................................................................................427
Improving NFS client large file writing performance........................................................................ 427
Streamline security subroutines with password indexing................................................................ 428
BSR Shared Memory................................................................................................................................429
VMM fork policy....................................................................................................................................... 430
Data compression by using the zlibNX library........................................................................................ 431
Nest accelerators.....................................................................................................................................437
nx_config_query() subroutine............................................................................................................438
nx_get_exclusive_access() subroutine............................................................................................. 441
nx_rel_excl_access() subroutine.......................................................................................................442
Notices..............................................................................................................443
Privacy policy considerations.................................................................................................................. 444
Trademarks.............................................................................................................................................. 445
Index................................................................................................................ 447
vi
About this document
This topic collection provides application programmers, customer engineers, system engineers, system
administrators, experienced end users, and system programmers with complete information about how
to perform such tasks as assessing and tuning the performance of processors, file systems, memory,
disk I/O, Network File System (NFS), Java, and communications I/O. This topic collection also discusses
efficient system and application design, including their implementation. This topic collection is also
available on the documentation CD that is included with the operating system.
Highlighting
The following highlighting conventions are used in this document:
Bold Identifies commands, subroutines, keywords, files, structures, directories, and other
items whose names are predefined by the system. Also identifies graphical objects
such as buttons, labels, and icons that the user selects.
Italics Identifies parameters whose actual names or values are to be supplied by the user.
Identifies examples of specific data values, examples of text similar to what you
Monospace
might see displayed, examples of portions of program code similar to what you
might write as a programmer, messages from the system, or information you should
actually type.
Case-sensitivity in AIX
Everything in the AIX® operating system is case-sensitive, which means that it distinguishes between
uppercase and lowercase letters. For example, you can use the ls command to list files. If you type LS,
the system responds that the command is not found. Likewise, FILEA, FiLea, and filea are three
distinct file names, even if they reside in the same directory. To avoid causing undesirable actions to be
performed, always ensure that you use the correct case.
ISO 9000
ISO 9000 registered quality systems were used in the development and manufacturing of this product.
November 2022
• Added new information about fast_lnk_recon which enables or disables Quick I/O failure for FC link
down functionality in the “Disk and disk adapter tunable parameters” on page 417 topic.
December 2021
• Updated information about support for 64 KB pages in “Active Memory Expansion (AME)” on page 334.
AME supports enablement of 64 KB pages by default starting with Power10 processor-based servers.
• Updated information about the Fibre Channel Adapter Outstanding-Requests Limit in the “Disk and disk
adapter tunable parameters” on page 417 topic.
• Starting from AIX 7.3, the M:N thread model is no longer supported. Updated the “Thread tuning ” on
page 63 topic.
System workload
An accurate and complete definition of a system's workload is critical to predicting or understanding its
performance.
A difference in workload can cause far more variation in the measured performance of a system than
differences in CPU clock speed or random access memory (RAM) size. The workload definition must
include not only the type and rate of requests sent to the system, but also the exact software packages
and in-house application programs to be executed.
It is important to include the work that a system is doing in the background. For example, if a system
contains file systems that are NFS-mounted and frequently accessed by other systems, handling those
Performance objectives
After defining the workload that your system will have to process, you can choose performance criteria
and set performance objectives based on those criteria.
The overall performance criteria of computer systems are response time and throughput.
Response time is the elapsed time between when a request is submitted and when the response from that
request is returned. Examples include:
• The amount of time a database query takes
• The amount of time it takes to echo characters to the terminal
• The amount of time it takes to access a Web page
Throughput is a measure of the amount of work that can be accomplished over some unit of time.
Examples include:
• Database transactions per minute
• Kilobytes of a file transferred per second
• Kilobytes of a file read or written per second
• Web server hits per minute
The relationship between these metrics is complex. Sometimes you can have higher throughput at the
cost of response time or better response time at the cost of throughput. In other situations, a single
change can improve both. Acceptable performance is based on reasonable throughput combined with
reasonable response time.
In planning for or tuning any system, make sure that you have clear objectives for both response time
and throughput when processing the specified workload. Otherwise, you risk spending analysis time and
resource dollars improving an aspect of system performance that is of secondary importance.
To run, a program must make its way up both the hardware and operating-system hierarchies in parallel.
Each element in the hardware hierarchy is more scarce and more expensive than the element below it.
Not only does the program have to contend with other programs for each resource, the transition from
one level to the next takes time. To understand the dynamics of program execution, you need a basic
understanding of each of the levels in the hierarchy.
Hardware hierarchy
Usually, the time required to move from one hardware level to another consists primarily of the latency of
the lower level (the time from the issuing of a request to the receipt of the first data).
Fixed disks
The slowest operation for a running program on a standalone system is obtaining code or data from a disk,
for the following reasons:
• The disk controller must be directed to access the specified blocks (queuing delay).
• The disk arm must seek to the correct cylinder (seek latency).
• The read/write heads must wait until the correct block rotates under them (rotational latency).
• The data must be transmitted to the controller (transmission time) and then conveyed to the application
program (interrupt-handling time).
Slow disk operations can have many causes besides explicit read or write requests in the program.
System-tuning activities frequently prove to be hunts for unnecessary disk I/O.
Performance management 3
Real memory
Real memory, often referred to as Random Access Memory, or RAM, is faster than disk, but much more
expensive per byte. Operating systems try to keep in RAM only the code and data that are currently in use,
storing any excess onto disk, or never bringing them into RAM in the first place.
RAM is not necessarily faster than the processor though. Typically, a RAM latency of dozens of processor
cycles occurs between the time the hardware recognizes the need for a RAM access and the time the data
or instruction is available to the processor.
If the access is going to a page of virtual memory that is stored over to disk, or has not been brought in
yet, a page fault occurs, and the execution of the program is suspended until the page has been read from
disk.
Caches
To minimize the number of times the program has to experience the RAM latency, systems incorporate
caches for instructions and data. If the required instruction or data is already in the cache, a cache hit
results and the instruction or data is available to the processor on the next cycle with no delay. Otherwise,
a cache miss occurs with RAM latency.
In some systems, there are two or three levels of cache, usually called L1, L2, and L3. If a particular
storage reference results in an L1 miss, then L2 is checked. If L2 generates a miss, then the reference
goes to the next level, either L3, if it is present, or RAM.
Cache sizes and structures vary by model, but the principles of using them efficiently are identical.
Executable programs
When you request a program to run, the operating system performs a number of operations to transform
the executable program on disk to a running program.
First, the directories in the your current PATH environment variable must be scanned to find the correct
copy of the program. Then, the system loader (not to be confused with the ld command, which is the
binder) must resolve any external references from the program to shared libraries.
To represent your request, the operating system creates a process, or a set of resources, such as a private
virtual address segment, which is required by any running program.
The operating system also automatically creates a single thread within that process. A thread is the
current execution state of a single instance of a program. In AIX, access to the processor and other
resources is allocated on a thread basis, rather than a process basis. Multiple threads can be created
within a process by the application program. Those threads share the resources owned by the process
within which they are running.
Finally, the system branches to the entry point of the program. If the program page that contains the entry
point is not already in memory (as it might be if the program had been recently compiled, executed, or
copied), the resulting page-fault interrupt causes the page to be read from its backing storage.
Interrupt handlers
The mechanism for notifying the operating system that an external event has taken place is to interrupt
the currently running thread and transfer control to an interrupt handler.
Before the interrupt handler can run, enough of the hardware state must be saved to ensure that the
system can restore the context of the thread after interrupt handling is complete. Newly invoked interrupt
handlers experience all of the delays of moving up the hardware hierarchy (except page faults). Unless the
interrupt handler was run very recently (or the intervening programs were very economical), it is unlikely
that any of its code or data remains in the TLBs or the caches.
When the interrupted thread is dispatched again, its execution context (such as register contents) is
logically restored, so that it functions correctly. However, the contents of the TLBs and caches must
be reconstructed on the basis of the program's subsequent demands. Thus, both the interrupt handler
and the interrupted thread can experience significant cache-miss and TLB-miss delays as a result of the
interrupt.
Waiting threads
Whenever an executing program makes a request that cannot be satisfied immediately, such as a
synchronous I/O operation (either explicit or as the result of a page fault), that thread is put in a waiting
state until the request is complete.
Normally, this results in another set of TLB and cache latencies, in addition to the time required for the
request itself.
Performance management 5
Dispatchable threads
When a thread is dispatchable but not running, it is accomplishing nothing useful. Worse, other threads
that are running may cause the thread's cache lines to be reused and real memory pages to be reclaimed,
resulting in even more delays when the thread is finally dispatched.
System tuning
After efficiently implementing application programs, further improvements in the overall performance of
your system becomes a matter of system tuning.
The main components that are subject to system-level tuning are:
Communications I/O
Depending on the type of workload and the type of communications link, it might be necessary to tune
one or more of the following communications device drivers: TCP/IP, or NFS.
Fixed Disk
The Logical Volume Manager (LVM) controls the placement of file systems and paging spaces on
the disk, which can significantly affect the amount of seek latency the system experiences. The disk
device drivers control the order in which I/O requests are acted upon.
Real Memory
The Virtual Memory Manager (VMM) controls the pool of free real-memory frames and determines
when and from where to steal frames to replenish the pool.
Running Thread
The scheduler determines which dispatchable entity should next receive control. In AIX, the
dispatchable entity is a thread. See “Thread support ” on page 36.
Performance management 7
results from a production environment against the flexibility of the nonproduction environment, where the
analyst can perform experiments that risk performance degradation or worse.
The most valuable aspect of quantifying the objectives is not selecting numbers to be achieved, but
making a public decision about the relative importance of (usually) multiple objectives. Unless these
priorities are set in advance, and understood by everyone concerned, the analyst cannot make trade-off
decisions without incessant consultation. The analyst is also apt to be surprised by the reaction of
users or management to aspects of performance that have been ignored. If the support and use of the
system crosses organizational boundaries, you might need a written service-level agreement between
the providers and the users to ensure that there is a clear common understanding of the performance
objectives and priorities.
CPU
• Processor time slice
• CPU entitlement or Micro-Partitioning
• Virtual Ethernet
Memory
• Page frames
• Stacks
• Buffers
• Queues
• Tables
Disk space
• Logical volumes
• File systems
• Files
• Logical partitions
• Virtual SCSI
Network access
• Sessions
• Packets
• Channels
• Shared Ethernet
It is important to be aware of logical and virtual resources as well as real resources. Threads can be
blocked by a lack of logical resources just as for a lack of real resources, and expanding the underlying
real resource does not necessarily ensure that additional logical resources will be created. For example,
the NFS server daemon, or nfsd daemon on the server is required to handle each pending NFS remote
I/O request. The number of nfsd daemons therefore limits the number of NFS I/O operations that
can be in progress simultaneously. When a shortage of nfsd daemons exists, system instrumentation
might indicate that various real resources, like the CPU, are used only slightly. You might have the false
impression that your system is under-used and slow, when in fact you have a shortage of nfsd daemons
which constrains the rest of the resources. A nfsd daemon uses processor cycles and memory, but you
cannot fix this problem simply by adding real memory or upgrading to a faster CPU. The solution is to
create more of the logical resource, the nfsd daemons.
Logical resources and bottlenecks can be created inadvertently during application development. A
method of passing data or controlling a device may, in effect, create a logical resource. When such
resources are created by accident, there are generally no tools to monitor their use and no interface to
control their allocation. Their existence may not be appreciated until a specific performance problem
highlights their importance.
Performance management 9
Structuring for parallel use of resources
Because workloads require multiple system resources to run, take advantage of the fact that the
resources are separate and can be consumed in parallel.
For example, the operating system read-ahead algorithm detects the fact that a program is accessing a
file sequentially and schedules additional sequential reads to be done in parallel with the application's
processing of the previous data. Parallelism applies to system management as well. For example, if an
application accesses two or more files at the same time, adding an additional disk drive might improve the
disk-I/O rate if the files that are accessed at the same time are placed on different drives.
When we measure the elapsed (wall-clock) time required to process a system call, we get a number that
consists of the following:
• The actual time during which the instructions to perform the service were executing
• Varying amounts of time during which the processor was stalled while waiting for instructions or data
from memory (that is, the cost of cache and TLB misses)
• The time required to access the clock at the beginning and end of the call
• Time consumed by periodic events, such as system timer interrupts
• Time consumed by more or less random events, such as I/O interrupts
To avoid reporting an inaccurate number, we normally measure the workload a number of times. Because
all of the extraneous factors add to the actual processing time, the typical set of measurements has a
curve of the form shown in the following illustration.
The extreme low end may represent a low-probability optimum caching situation or may be a rounding
effect.
A regularly recurring extraneous event might give the curve a bimodal form (two maxima), as shown in the
following illustration.
Performance management 11
Figure 4. Bimodal Curve
One or two time-consuming interrupts might skew the curve even further, as shown in the following
illustration:
The distribution of the measurements about the actual value is not random, and the classic tests of
inferential statistics can be applied only with great caution. Also, depending on the purpose of the
measurement, it may be that neither the mean nor the actual value is an appropriate characterization of
performance.
# vmstat 5 2
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
1 1 197167 477552 0 0 0 7 21 0 106 1114 451 0 0 99 0
0 0 197178 477541 0 0 0 0 0 0 443 1123 442 0 0 99 0
Remember that the first report from the vmstat command displays cumulative activity since the last
system boot. The second report shows activity for the first 5-second interval.
For detailed discussions of the vmstat command, see “vmstat command” on page 92, “Memory usage
determination with the vmstat command” on page 115, and “Assessing disk performance with the vmstat
command ” on page 166.
To enable disk I/O history, from the command line enter smit chgsys and then select true from the
Continuously maintain DISK I/O history field.
The following sample report is displayed when you run the iostat command:
# iostat 5 2
Performance management 13
The interval disk I/O statistics are unaffected by this.
The first report from the iostat command shows cumulative activity since the last reset of the disk
activity counters. The second report shows activity for the first 5-second interval.
Related concepts
The iostat command
The iostat command is the fastest way to get a first impression, whether or not the system has a disk
I/O-bound performance problem.
Related tasks
Assessing disk performance with the iostat command
Begin the assessment by running the iostat command with an interval parameter during your system's
peak workload period or while running a critical application for which you need to minimize I/O delays.
# netstat -I en0 5
input (en0) output input (Total) output
packets errs packets errs colls packets errs packets errs colls
8305067 0 7784711 0 0 20731867 0 20211853 0 0
3 0 1 0 0 7 0 5 0 0
24 0 127 0 0 28 0 131 0 0
CTRL C
Remember that the first report from the netstat command shows cumulative activity since the last
system boot. The second report shows activity for the first 5-second interval.
Other useful netstat command options are -s and -v. For details, see “netstat command ” on page 273.
# sar -P ALL 5 2
Average 0 2 0 0 98
1 0 0 0 100
2 0 0 0 99
3 1 0 0 99
- 1 0 0 99
The sar command does not report the cumulative activity since the last system boot.
Performance management 15
hdisk0 0.0 0.0 0.0 0.0 0.0 PgspIn 0 % Noncomp 15.8
hdisk1 0.0 0.0 0.0 0.0 0.0 PgspOut 0 % Client 14.7
PageIn 0
WLM-Class (Active) CPU% Mem% Disk-I/O% PageOut 0 PAGING SPACE
System 0 0 0 Sios 0 Size,MB 512
Shared 0 0 0 % Used 1.2
Default 0 0 0 NFS (calls/sec) % Free 98.7
Name PID CPU% PgSp Class 0 ServerV2 0
topas 10442 3.0 0.8 System ClientV2 0 Press:
ksh 13438 0.0 0.4 System ServerV3 0 "h" for help
gil 1548 0.0 0.0 System ClientV3 0 "q" to quit
Except for the variable Processes subsection, you can sort all of the subsections by any column by
moving the cursor to the top of the desired column. All of the variable subsections, except the Processes
subsection, have the following views:
• List of top resource users
• One-line report presenting the sum of the activity
For example, the one-line-report view might show just the total disk or network throughput.
For the CPU subsection, you can select either the list of busy processors or the global CPU utilization, as
shown in the above example.
Topas Monitor for host: aixhost Interval: 2 Wed Feb 4 11:24:05 2004
==============================================================================
DATA TEXT PAGE PGFAULTS
USER PID PPID PRI NI RES RES SPACE TIME CPU% I/O OTH COMMAND
root 1 0 60 20 202 9 202 0:04 0.0 0 0 init
root 774 0 17 41 4 0 4 0:00 0.0 0 0 reaper
root 1032 0 60 41 4 0 4 0:00 0.0 0 0 xmgc
root 1290 0 36 41 4 0 4 0:01 0.0 0 0 netm
root 1548 0 37 41 17 0 17 1:24 0.0 0 0 gil
root 1806 0 16 41 4 0 4 0:00 0.0 0 0 wlmsched
root 2494 0 60 20 4 0 4 0:00 0.0 0 0 rtcmd
root 2676 1 60 20 91 10 91 0:00 0.0 0 0 cron
root 2940 1 60 20 171 22 171 0:00 0.0 0 0 errdemon
root 3186 0 60 20 4 0 4 0:00 0.0 0 0 kbiod
Topas Monitor for host: aixcomm Interval: 2 Fri Jan 13 18:00:16 XXXX
===============================================================================
Disk Busy% KBPS TPS KB-R ART MRT KB-W AWT MWT AQW AQD
hdisk0 3.0 56.0 3.5 0.0 0.0 5.4 56.0 5.8 33.2 0.0 0.0
cd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Performance management 17
Partitions can be sorted by any column except Host, OS, and M, by moving the cursor to the top of the
appropriate column.
ptoolsl1.austin.ibm.com
9.3.41.206
...
Select the Add Host to topas external subnet search file (Rsi.hosts) option to add hosts to the Rsi.hosts
file. Select the List hosts in topas external subnet search file (Rsi.hosts) option to see the list of options
in the Rsi.hosts file.
Persistent recording
Persistent recordings are those recordings that are started from SMIT with the option to specify the cut
and retention. You can specify the number of days of recording to be stored per recording file (cut) and
the number of days of recording to be retained (retention) before it can be deleted. Not more than one
instance of Persistent recording of the same type (CEC or local) recording can be run in a system. When
a Persistent recording is started, the recording command will be invoked with user-specified options. The
same set of command line options used by this persistent recording will be added to inittab entries.
This will ensure that the recording is started automatically on reboot or restart of the system.
Consider a system that is already running a Persistent local recording (binary or nmon recording format).
If you want to start a new Persistent recording of local binary recording, the existing persistent recording
must be stopped first using the Stop Persistent Recording option available under the Stop Recording
option. Then a new persistent local recording must be started from Start Persistent local recording
option. Starting Persistent recording will fail if a persistent recording of the same recording format is
already running in the system. Because Persistent recording adds inittab entries, only privileged users
are allowed to start Persistent recording.
For example, if the number of days to store per file is n, then a single file will contain a maximum of n days
of recording. If the recording exceeds n days, then a new file will be created and all of the subsequent
recordings will be stored in the new file. If the number of days to store per file is 0, the recording will be
written to only one file. If the number of days to retain is m, then the system will retain the recording file
that has data recorded within the last m days. Recording files generated by the same recording instance
of the topasrec command that have recorded data earlier than m days will be deleted.
The default value for number of days to store per file is 1.
The default value for number of days to retain is 7.
The SMIT options for Start Recording menu displays:
Start Recording
Move cursor to desired item and press Enter.
Performance management 19
binary
nmon
If you have selected a report that is of binary type, the report displays as:
The recording interval (in seconds) should be a multiple of 60. If the recording type is local binary
recording, then the user has an option to enable the IBM Workload Estimator (WLE) report generation in
the SMIT screen. The WLE report is generated only on Sundays at 00:45 a.m. and requires local binary
recording to always be enabled for consistent data in the report. The data in the weekly report is correct
only if the local recordings are always enabled.
The generated WLE report is stored in the /etc/perf/<hostname>_aixwle_weekly.xml file. For example,
if the hostname is ptoolsl1, the weekly report is written to the /etc/perf/ptoolsl1_aixwle_weekly.xml
file.
If the filesystem threshold is enabled, the topasrec command pauses the recording when the file
system used space crosses the maximum percentage of the specified filesystem threshold. It will then
write an error log entry. The topasrec command resumes the recording when the file system used space
goes back to less than the minimum percentage of the file system threshold.
For additional information, refer to:
• “Persistent recording” on page 19
• available nmon filters
Type of Recording
binary
nmon
Length of Recording
day
hour
custom
For day or hour recording, recording interval and number of samples are not editable. For custom
recording, recording interval and number of samples are editable. Recording interval should be a multiple
of 60. The use of custom recording is to collect only the specified number of samples at the specified
interval and exit recording. If the number of samples is specified as zero, then the recording will be
continuously running until stopped.
The preloaded values shown in the screen are the default values.
For more information, refer to nmon_recording.dita.
Length of Recording
day
hour
custom
For day or hour recording, recording intervals and number of samples are not editable.
For custom recording, recording intervals and number of samples are editable and recording intervals
should be a multiple of 60. The use of custom recording is to collect only the specified number of
samples at the specified interval and exit recording. If the number of samples is specified as zero then the
recording will be continuously running until stopped.
NMON Recording
NMON comes with recording filters that help you customize the NMON recording. You can select and
deselect the following sections of the NMON recording:
• JFS
• RAW kernel and LPAR
• volume group
• paging space
• MEMPAGES
• NFS
• WLM
• Large Page
• Shared Ethernet (for VIOS) Process
• Large Page and Asynchronous
Note: Disks per line, disk group file, and desired disks are applicable options only if the disk configuration
section is included in the recording. The process filter and process threshold options are applicable only if
the processes list is included in the recording.
Performance management 21
Process and Disk filters will be automatically loaded with the filter options used for the last recording by
the same user. You can specify that the external command be invoked at the start or end of the NMON
recording in an External data collector start or end program. If you want the external command to be
invoked periodically to record metrics, it can be specified at the External data collector snap program. The
nmon command provides more details on using external commands for a NMON recording.
Naming Convention
Recorded files will be stored in specified files as shown in the following:
• Given a file name that contains the directory and a file name prefix, the output file for a single file
recording is:
Style Files
Local Nmon <filename>_YYMMDD_HHMM.nmon
Style:
Local Nmon <filename>_YYMMDD_HHMM.topas
Style:
Topas Style CEC: <filename>_YYMMDD_HHMM.topas
• Given a file name that contains the directory and a file name prefix, the output file for multiple file
recordings (cut and retention) is:
Style Files
Local Nmon <filename>_YYMMDD.nmon
Style:
Local Nmon <filename>_YYMMDD.topas
Style:
Topas Style CEC: <filename>_CEC_YYMMDD.topas
• Given a file name that contains the directory and no file name prefix, the output file for a single file
recording is:
Style Files
Local Nmon <filename/hostname>_YYMMDD_HHMM.nmon
Style:
Local Nmon <filename/hostname>_YYMMDD_HHMM.topas
Style:
Topas Style CEC: <filename/hostname>_CEC_YYMMDD_HHMM.topas
• Given a file name that contains the directory and no file name prefix, the output files for multiple file
recordings (cut and retention) is:
Style Files
Local Nmon <filename/hostname>_YYMMDD.nmon
Style:
Local Nmon <filename/hostname>_YYMMDD.topas
Style:
Topas Style CEC: <filename/hostname>_CEC_YYMMDD.topas
Two recordings of the same recording format and with the same filename parameter values (default
or user-specified filename) cannot be started simultaneously as these two recording processes tend to
write to the same recording file.
Examples:
Stop Recording
Use the Stop recording to stop the currently running recording. The user can select one particular running
recording from the list and stop it.
From the menu, you must select the type of recording to stop. After selecting the type of recording, the
currently running recording will be listed on the menu. You can then select a recording to be stopped.
Following is the screen for selecting the type of recording to stop:
Stop Recording
Note: The recording can only be stopped if you have the requisite permission to stop the recording
process.
Type of
Recording
persistent
binary
nmon
cec
all
This will list the Format, Start time, and Output path of the active recordings and their specified path.
The output path of all persistent recordings will be prefixed by the asterisk (*). For persistent local
binary recording with WLE enabled, the output path will be prefixed by the number sign (#).
Performance management 23
1. Enter the path of the recording. This is the path used to locate the recording file.
2. Select the type of recording to be used.
persistent
binary
nmon
cec
all
This will list the Recording Type, Start time and Stop time of the completed recordings in the
specified path.
1 Filename
2 Printer
* Reporting Format [] +
* Type of Recording []
* Reporting Format []
Type [mean] +
Recording File name []
* Output File []
* Type of Recording []
* Reporting Format []
Recording File name []
* Output File []
Note: The Output file field is mandatory for comma separated/spreadsheet, and nmon types and optional
for all other reporting formats. The topas recordings support only mean type for comma separated and
spreadsheet reporting formats.
The following is an example of a summary/disk summary/detailed/network summary report type:
* Type of Recording []
* Reporting Format []
For all the above examples, the first two fields are non-modifiable and filled with values from the previous
selections.
If printer is selected as the report output, the Output File field is replaced with the required Printer
Name field from a list of printers configured in the system:
* Printer Name [ ] +
[Entry Fields]
* Enter the Date [YYYYMMDD] [] #
– If a user wants to re-transmit the PM Data dated 12th Feb, 2009, enter 20090212 in the text box.
– Enter 0 to transmit all the available recorded PM data files.
• After the date has been entered, it displays the following manual steps to send Data to IBM using ESA or
HMC:
1. Login to HMC
2. Select 'Service Management'
3. Select 'Transmit Service Information'
4. Click the 'Send' button labeled 'To transmit the performance management information
immediately,
click Send. (the second Send button on the page)
5. Check the Console Events log for results
Performance management 25
4. Select the Performance Management checkbox
5. Click OK
6. Check the Activity Log for results"
WLE
WLE Collection
WLE input type
• WLE Collection
Use WLE Collection to enable or disable creation of reports to be used as an input to WLE. The field
gives the current state of WLE Collection by default. Collection should be disabled before you can stop
an associated recording.
• WLE input type
Use WLE input type to decide if WLE reports should be based on the currently running local binary
recording or the local nmon recording. Note that this is applicable only for persistent recordings.
Performance management 27
Everything runs slowly at a particular time of day
There are several reasons why the system may slow down at certain times of the day.
Most people have experienced the rush-hour slowdown that occurs because a large number of people in
the organization habitually use the system at one or more particular times each day. This phenomenon
is not always simply due to a concentration of load. Sometimes it is an indication of an imbalance that
is only a problem when the load is high. Other sources of recurring situations in the system should be
considered.
• If you run the iostat and netstat commands for a period that spans the time of the slowdown, or if
you have previously captured data from your monitoring mechanism, are some disks much more heavily
used than others? Is the CPU idle percentage consistently near zero? Is the number of packets sent or
received unusually high?
– If the disks are unbalanced, see “Logical volume and disk I/O performance” on page 161.
– If the CPU is saturated, use the ps or topas commands to identify the programs being run during this
period. The sample script given in “Continuous system-performance monitoring with commands” on
page 13 simplifies the search for the heaviest CPU users.
– If the slowdown is counter-intuitive, such as paralysis during lunch time, look for a pathological
program such as a graphic xlock or game program. Some versions of the xlock program are known
to use huge amounts of CPU time to display graphic patterns on an idle display. It is also possible that
someone is running a program that is a known CPU burner and is trying to run it at the least intrusive
time.
• Unless your /var/adm/cron/cron.allow file is null, you may want to check the contents of
the /var/adm/cron/crontab directory for expensive operations.
If you find that the problem stems from conflict between foreground activity and long-running, CPU-
intensive programs that are, or should be, run in the background, consider changing the way priorities are
calculated using the schedo command to give the foreground higher priority. See “Thread-Priority-Value
calculation” on page 111.
Then run the same commands under a user ID that is not experiencing performance problems. Is there
a difference in the reported real time?
• A program should not show much CPU time (user+sys) difference from run to run, but may show a real
time difference because of more or slower I/O. Are the user's files on an NFS-mounted directory? Or on
a disk that has high activity for other reasons?
Performance management 29
• “File system performance” on page 214
• “Network performance analysis” on page 270
• “NFS performance monitoring and tuning” on page 304
# echo $AUTHSTATE
If you want to ensure that you are using a local authentication mechanism first and then the network-
based authentication mechanism, like DCE for example, type the following at the command line:
# export AUTHSTATE="compat,DCE"
# vmstat 5
In the example above, because there is no count specified following the interval, reporting continues until
you cancel the command.
The following vmstat report was created on a system running AIXwindows and several synthetic
applications (some low-activity intervals have been removed for example purposes):
# iostat 5 3
The following iostat report was created on a system running the same workload as the one in the
vmstat example above, but at a different time. The first report represents the cumulative activity
since the preceding boot, while subsequent reports represent the activity during the preceding 5-second
interval:
The first report shows that the I/O on this system is unbalanced. Most of the I/O (86.9 percent of
kilobytes read and 90.7 percent of kilobytes written) goes to hdisk2, which contains both the operating
system and the paging space. The cumulative CPU utilization since boot statistic is usually meaningless,
unless you use the system consistently, 24 hours a day.
The second report shows a small amount of disk activity reading from hdisk0, which contains a separate
file system for the system's primary user. The CPU activity arises from two application programs and the
iostat command itself.
In the third report, you can see that we artificially created a near-thrashing condition by running a
program that allocates and stores a large amount of memory, which is about 26 MB in the above example.
Also in the above example, hdisk2 is active 98.4 percent of the time, which results in 93.8 percent I/O
wait.
Performance management 31
The limiting factor for a single program
If you are the sole user of a system, you can get a general idea of whether a program is I/O or CPU
dependent by using the time command as follows:
real 0m0.13s
user 0m0.01s
sys 0m0.02s
Note: Examples of the time command use the version that is built into the Korn shell, ksh. The official
time command, /usr/bin/time, reports with a lower precision.
In the above example, the fact that the real elapsed time for the execution of the cp program (0.13
seconds) is significantly greater than the sum (.03 seconds) of the user and system CPU times indicates
that the program is I/O bound. This occurs primarily because the foo.in file has not been read recently.
On an SMP, the output takes on a new meaning. See “Considerations of the time and timex commands ”
on page 101 for more information.
Running the same command a few seconds later against the same file gives the following output:
real 0m0.06s
user 0m0.01s
sys 0m0.03s
Most or all of the pages of the foo.in file are still in memory because there has been no intervening
process to cause them to be reclaimed and because the file is small compared with the amount of RAM
on the system. A small foo.out file would also be buffered in memory, and a program using it as input
would show little disk dependency.
If you are trying to determine the disk dependency of a program, you must be sure that its input is in
an authentic state. That is, if the program will normally be run against a file that has not been accessed
recently, you must make sure that the file used in measuring the program is not in memory. If, on the
other hand, a program is usually run as part of a standard sequence in which it gets its input from the
output of the preceding program, you should prime memory to ensure that the measurement is authentic.
For example, the following command would have the effect of priming memory with the pages of the
foo.in file:
# cp foo.in /dev/null
The situation is more complex if the file is large compared to RAM. If the output of one program is the
input of the next and the entire file will not fit in RAM, the second program will read pages at the head of
the file, which displaces pages at the end. Although this situation is very hard to simulate authentically, it
is nearly equivalent to one in which no disk caching takes place.
The case of a file that is perhaps just slightly larger than RAM is a special case of the RAM versus disk
analysis discussed in the next section.
real 0m0.03s
user 0m0.01s
sys 0m0.02s
2323 paging space page ins
2323 paging space page ins
4850 paging space page outs
4850 paging space page outs
The before-and-after paging statistics are identical, which confirms our belief that the cp command is not
paging-bound. An extended variant of the vmstatit shell script can be used to show the true situation,
as follows:
vmstat -s >temp.file
time $1
vmstat -s >>temp.file
echo "Ordinary Input:" >>results
grep "^[ 0-9]*page ins" temp.file >>results
echo "Ordinary Output:" >>results
grep "^[ 0-9]*page outs" temp.file >>results
echo "True Paging Output:" >>results
grep "pagi.*outs" temp.file >>results
echo "True Paging Input:" >>results
grep "pagi.*ins" temp.file >>results
Because file I/O in the operating system is processed through the VMM, the vmstat -s command reports
ordinary program I/O as page ins and page outs. When the previous version of the vmstatit shell script
was run against the cp command of a large file that had not been read recently, the result was as follows:
real 0m2.09s
user 0m0.03s
sys 0m0.74s
Ordinary Input:
46416 page ins
47132 page ins
Ordinary Output:
146483 page outs
147012 page outs
True Paging Output:
4854 paging space page outs
4854 paging space page outs
True Paging Input:
2527 paging space page ins
2527 paging space page ins
The time command output confirms the existence of an I/O dependency. The increase in page ins shows
the I/O necessary to satisfy the cp command. The increase in page outs indicates that the file is large
enough to force the writing of dirty pages (not necessarily its own) from memory. The fact that there is
no change in the cumulative paging-space-I/O counts confirms that the cp command does not build data
structures large enough to overload the memory of the test machine.
The order in which this version of the vmstatit script reports I/O is intentional. Typical programs read
file input and then write file output. Paging activity, on the other hand, typically begins with the writing
out of a working-segment page that does not fit. The page is read back in only if the program tries to
access it. The fact that the test system has experienced almost twice as many paging space page
outs as paging space page ins since it was booted indicates that at least some of the programs that
have been run on this system have stored data in memory that was not accessed again before the end
of the program. “Memory-limited programs ” on page 84 provides more information. See also “Memory
performance” on page 114.
Performance management 33
To show the effects of memory limitation on these statistics, the following example observes a given
command in an environment of adequate memory (32 MB) and then artificially shrinks the system using
the rmss command (see “Memory requirements assessment with the rmss command ” on page 128). The
following command sequence
# cc -c ed.c
# vmstatit "cc -c ed.c" 2>results
first primes memory with the 7944-line source file and the executable file of the C compiler, then
measures the I/O activity of the second execution:
real 0m7.76s
user 0m7.44s
sys 0m0.15s
Ordinary Input:
57192 page ins
57192 page ins
Ordinary Output:
165516 page outs
165553 page outs
True Paging Output:
10846 paging space page outs
10846 paging space page outs
True Paging Input:
6409 paging space page ins
6409 paging space page ins
Clearly, this is not I/O limited. There is not even any I/O necessary to read the source code. If we then
issue the following command:
# rmss -c 8
to change the effective size of the machine to 8 MB, and perform the same sequence of commands, we
get the following output:
real 0m9.87s
user 0m7.70s
sys 0m0.18s
Ordinary Input:
57625 page ins
57809 page ins
Ordinary Output:
165811 page outs
165882 page outs
True Paging Output:
11010 paging space page outs
11061 paging space page outs
True Paging Input:
6623 paging space page ins
6701 paging space page ins
# rmss -r
Resource management
AIX provides tunable components to manage the resources that have the most effect on system
performance.
For specific tuning recommendations see the following:
• “Microprocessor performance” on page 92.
• “Memory performance” on page 114.
• “Logical volume and disk I/O performance” on page 161.
• “Network performance” on page 235.
• “NFS performance” on page 299.
Performance management 35
Processor scheduler performance
There are several performance-related issues to consider regarding the processor scheduler.
Thread support
A thread can be thought of as a low-overhead process. It is a dispatchable entity that requires fewer
resources to create than a process. The fundamental dispatchable entity of the AIX Version 4 scheduler is
the thread.
Processes are composed of one or more threads. In fact, workloads migrated directly from earlier
releases of the operating system continue to create and manage processes. Each new process is created
with a single thread that has its parent process priority and contends for the processor with the threads
of other processes. The process owns the resources used in execution; the thread owns only its current
state.
When new or modified applications take advantage of the operating system's thread support to create
additional threads, those threads are created within the context of the process. They share the process's
private segment and other resources.
A user thread within a process has a specified contention scope. If the contention scope is global, the
thread contends for processor time with all other threads in the system. The thread that is created when a
process is created has global contention scope. If the contention scope is local, the thread contends with
the other threads within the process to be the recipient of the process's share of processor time.
The algorithm for determining which thread should be run next is called a scheduling policy.
The nice value of a thread is set when the thread is created and is constant over the life of the
thread, unless explicitly changed by the user through the renice command or the setpri(), setpriority(),
thread_setsched(), or nice() system calls.
Performance management 37
The processor penalty is an integer that is calculated from the recent processor usage of a thread. The
recent processor usage increases by approximately 1 each time the thread is in control of the processor at
the end of a 10 ms clock tick, up to a maximum value of 120. The actual priority penalty per tick increases
with the nice value. Once per second, the recent processor usage values for all threads are recalculated.
The result is the following:
• The priority of a nonfixed-priority thread becomes less favorable as its recent processor usage increases
and vice versa. This implies that, on average, the more time slices a thread has been allocated recently,
the less likely it is that the thread will be allocated the next time slice.
• The priority of a nonfixed-priority thread becomes less favorable as its nice value increases, and vice
versa.
Note: With the use of multiple processor run queues and their load balancing mechanism, nice or
renice values might not have the expected effect on thread priorities because less favored priorities
might have equal or greater run time than favored priorities. Threads requiring the expected effects of
nice or renice should be placed on the global run queue.
You can use the ps command to display the priority value, nice value, and short-term processor-usage
values for a process.
See “Controlling contention for the microprocessor” on page 109 for a more detailed discussion on using
the nice and renice commands.
See “Thread-Priority-Value calculation” on page 111, for the details of the calculation of the processor
penalty and the decay of the recent processor usage values.
The priority mechanism is also used by AIX Workload Manager to enforce processor resource
management. Because threads classified under the Workload Manager have their priorities managed by
the Workload Manager, they might have different priority behavior over threads not classified under the
Workload Manager.
Performance management 39
Figure 7. Run Queue
All the dispatchable threads with priority occupy positions in the run queue.
The fundamental dispatchable entity of the scheduler is the thread. AIX maintains 256 run queues. The
run queues relate directly to the range of possible values (0 through 255) for the priority field for each
thread. This method makes it easier for the scheduler to determine which thread is most favored to run.
Without having to search a single large run queue, the scheduler consults a mask where a bit is on to
indicate the presence of a ready-to-run thread in the corresponding run queue.
The priority value of a thread changes rapidly and frequently. The constant movement is because of the
way the scheduler recalculates priorities. This is not true, however, for fixed-priority threads.
Starting with AIX Version 6.1, each processor has a run queue per node. The run queue values that are
reported in the performance tools is the sum of all the threads in each run queue. Having a per-processor
run queue saves overhead on dispatching locks and improves overall processor affinity. Threads tend to
stay on the same processor more often. If a thread becomes executable because of an event on another
processor than the executable thread that it is running on, then the thread gets dispatched immediately
if there is an idle processor. No preemption occurs until the processor's state is examined such as an
interrupt on the thread's processor.
On multiprocessor systems with multiple run queues, transient priority inversions can occur. It is possible
that at any time one run queue has several threads with more favorable priority than another run queue.
AIX has mechanisms for priority balancing over time, but if strict priority is required (for example, for
real-time applications) an environment variable that is called RT_GRQ exists. The RT_GRQ environmental
variable when set to ON, causes the thread to be on a global run queue. In that case, the global run
queue is searched for the thread with the best priority. This can improve performance for threads that
are interrupt driven. Threads that are running at fixed priority are placed on the global run queue, if the
fixed_pri_global parameter of the schedo command is set to 1.
The average number of threads in the run queue is seen in the first column of the vmstat command
output. If you divide this number by the number of processors, the result is the average number of
threads that are run on each processor. If this value is greater than one, these threads must wait their turn
for the processor the greater the number, the more likely it is that performance delays are noticed.
When a thread is moved to the end of the run queue (for example, when the thread has control at the end
of a time slice), it is moved to a position after the last thread in the queue that has the same priority value.
Mode switching
A user process undergoes a mode switch when it needs access to system resources. This is implemented
through the system call interface or by interrupts such as page faults.
There are two modes:
• User mode
• Kernel mode
Processor time spent in user mode (application and shared libraries) is reflected as user time in the
output of commands such as the vmstat, iostat, and sar commands. Processor time spent in kernel
mode is reflected as system time in the output of these commands.
User mode
Programs that execute in the user protection domain are user processes.
Code that executes in this protection domain executes in user execution mode, and has the following
access:
• Read/write access to user data in the process private region
• Read access to the user text and shared text regions
• Access to shared data regions using the shared memory functions
Programs executing in the user protection domain do not have access to the kernel or kernel data
segments, except indirectly through the use of system calls. A program in this protection domain can only
affect its own execution environment and executes in the process or unprivileged state.
Kernel mode
Programs that execute in the kernel protection domain include interrupt handlers, kernel processes, the
base kernel, and kernel extensions (device driver, system calls and file systems).
This protection domain implies that code executes in kernel execution mode, and has the following
access:
• Read/write access to the global kernel address space
• Read/write access to the kernel data in the process region when executing within a process
Kernel services must be used to access user data within the process address space.
Programs executing in this protection domain can affect the execution environments of all programs,
because they have the following characteristics:
• They can access global system data
• They can use kernel services
• They are exempt from all security restraints
• They execute in the processor privileged state.
Performance management 41
Mode switches
The use of a system call by a user-mode process allows a kernel function to be called from user mode.
Access to functions that directly or indirectly invoke system calls is typically provided by programming
libraries, which provide access to operating system functions.
Mode switches should be differentiated from the context switches seen in the output of the vmstat
(cs column) and sar (cswch/s) commands. A context switch occurs when the currently running thread is
different from the previously running thread on that processor.
The scheduler performs a context switch when any of the following occurs:
• A thread must wait for a resource (voluntarily), such as disk I/O, network I/O, sleep, or locks
• A higher priority thread wakes up (involuntarily)
• The thread has used up its time slice (usually 10 ms).
Context switch time, system calls, device interrupts, NFS I/O, and any other activity in the kernel is
considered as system time.
Real-memory management
The VMM plays an important role in the management of real memory.
Virtual-memory segments are partitioned into fixed-size units called pages. AIX 7.1 running on POWER5+
processors supports four page sizes: 4 KB, 64 KB, 16 MB, and 16 GB. For more information, see Multiple
page size support. Each page in a segment can be in real memory (RAM), or stored on disk until it
is needed. Similarly, real memory is divided into page frames. The role of the VMM is to manage the
allocation of real-memory page frames and to resolve references by the program to virtual-memory pages
that are not currently in real memory or do not yet exist (for example, when a process makes the first
reference to a page of its data segment).
Because the amount of virtual memory that is in use at any given instant can be larger than real memory,
the VMM must store the surplus on disk. From the performance standpoint, the VMM has two, somewhat
opposed, objectives:
• Minimize the overall processor-time and disk-bandwidth cost of the use of virtual memory.
• Minimize the response-time cost of page faults.
In pursuit of these objectives, the VMM maintains a free list of page frames that are available to satisfy
a page fault. The VMM uses a page-replacement algorithm to determine which virtual-memory pages
currently in memory will have their page frames reassigned to the free list. The page-replacement
algorithm uses several mechanisms:
• Virtual-memory segments are classified into either persistent segments or working segments.
• Virtual-memory segments are classified as containing either computational or file memory.
• Virtual-memory pages whose access causes a page fault are tracked.
• Page faults are classified as new-page faults or as repage faults.
• Statistics are maintained on the rate of repage faults in each virtual-memory segment.
• User-tunable thresholds influence the page-replacement algorithm's decisions.
Persistent-segment types are further classified. Client segments are used to map remote files (for
example, files that are being accessed through NFS), including remote executable programs. Pages
from client segments are saved and restored over the network to their permanent file location, not on
the local-disk paging space. Journaled and deferred segments are persistent segments that must be
atomically updated. If a page from a journaled or deferred segment is selected to be removed from real
memory (paged out), it must be written to disk paging space unless it is in a state that allows it to be
committed (written to its permanent file location).
Performance management 43
Computational versus file memory
Computational memory, also known as computational pages, consists of the pages that belong to
working-storage segments or program text (executable files) segments.
File memory (or file pages) consists of the remaining pages. These are usually pages from permanent data
files in persistent storage.
Page replacement
When the number of available real memory frames on the free list becomes low, a page stealer is invoked.
A page stealer moves through the Page Frame Table (PFT), looking for pages to steal.
The PFT includes flags to signal which pages have been referenced and which have been modified. If the
page stealer encounters a page that has been referenced, it does not steal that page, but instead, resets
the reference flag for that page. The next time the clock hand (page stealer) passes that page and the
reference bit is still off, that page is stolen. A page that was not referenced in the first pass is immediately
stolen.
The modify flag indicates that the data on that page has been changed since it was brought into memory.
When a page is to be stolen, if the modify flag is set, a pageout call is made before stealing the page.
Pages that are part of working segments are written to paging space; persistent segments are written to
disk.
In addition to the page-replacement, the algorithm keeps track of both new page faults (referenced for
the first time) and repage faults (referencing pages that have been paged out), by using a history buffer
that contains the IDs of the most recent page faults. It then tries to balance file (persistent data) page
outs with computational (working storage or program text) page outs.
When a process exits, its working storage is released immediately and its associated memory frames are
put back on the free list. However, any files that the process may have opened can stay in memory.
Repaging
A page fault is considered to be either a new page fault or a repage fault. A new page fault occurs when
there is no record of the page having been referenced recently. A repage fault occurs when a page that
is known to have been referenced recently is referenced again, and is not found in memory because the
page has been replaced (and perhaps written to disk) since it was last accessed.
A perfect page-replacement policy would eliminate repage faults entirely (assuming adequate real
memory) by always stealing frames from pages that are not going to be referenced again. Thus, the
number of repage faults is an inverse measure of the effectiveness of the page-replacement algorithm
in keeping frequently reused pages in memory, thereby reducing overall I/O demand and potentially
improving system performance.
To classify a page fault as new or repage, the VMM maintains a repage history buffer that contains the
page IDs of the N most recent page faults, where N is the number of frames that the memory can hold. For
example, 512 MB memory requires a 128 KB repage history buffer. At page-in, if the page's ID is found in
the repage history buffer, it is counted as a repage. Also, the VMM estimates the computational-memory
repaging rate and the file-memory repaging rate separately by maintaining counts of repage faults for
each type of memory. The repaging rates are multiplied by 0.9 each time the page-replacement algorithm
runs, so that they reflect recent repaging activity more strongly than historical repaging activity.
VMM thresholds
Several numerical thresholds define the objectives of the VMM. When one of these thresholds is
breached, the VMM takes appropriate action to bring the state of memory back within bounds. This
section discusses the thresholds that the system administrator can alter through the vmo command.
The number of page frames on the free list is controlled by the following parameters:
minfree
Minimum acceptable number of real-memory page frames in the free list. When the size of the free list
falls below this number, the VMM begins stealing pages. It continues stealing pages until the size of
the free list reaches maxfree.
maxfree
Maximum size to which the free list will grow by VMM page-stealing. The size of the free list may
exceed this number as a result of processes terminating and freeing their working-segment pages or
the deletion of files that have pages in memory.
The VMM attempts to keep the size of the free list greater than or equal to minfree. When page faults or
system demands cause the free list size to fall below minfree, the page-replacement algorithm runs. The
size of the free list must be kept above a certain level (the default value of minfree) for several reasons.
For example, the operating system's sequential-prefetch algorithm requires several frames at a time for
each process that is doing sequential reads. Also, the VMM must avoid deadlocks within the operating
system itself, which could occur if there were not enough space to read in a page that was required to free
a page frame.
The following thresholds are expressed as percentages. They represent the fraction of the total real
memory of the machine that is occupied by file pages (pages of noncomputational segments).
minperm
If the percentage of real memory occupied by file pages falls below this level, the page-replacement
algorithm steals both file and computational pages, regardless of repage rates.
maxperm
If the percentage of real memory occupied by file pages rises above this level, the page-replacement
algorithm steals only file pages.
Performance management 45
maxclient
If the percentage of real memory occupied by file pages is above this level, the page-replacement
algorithm steals only client pages.
When the percentage of real memory occupied by file pages is between the minperm and maxperm
parameter values, the VMM normally steals only file pages, but if the repaging rate for file pages is higher
than the repaging rate for computational pages, computational pages are stolen as well.
The main intent of the page-replacement algorithm is to ensure that computational pages are given fair
treatment. For example, the sequential reading of a long data file into memory should not cause the loss
of program text pages that are likely to be used again soon. The page-replacement algorithm's use of the
thresholds and repaging rates ensures that both types of pages get treated fairly, with a slight bias in favor
of computational pages.
Performance management 47
This technique does involve some degree of risk. If all of the programs running in a machine happened
to encounter maximum-size situations simultaneously, paging space might be exhausted. Some programs
might not be able to continue to completion.
# export PSALLOC=early
This example causes all future programs to be executed in the environment to use early allocation. The
currently executing shell is not affected.
Early allocation is of interest to the performance analyst mainly because of its paging-space size
implications. If early allocation is turned on for those programs, paging-space requirements can increase
many times. Whereas the normal recommendation for paging-space size is at least twice the size of
the system's real memory, the recommendation for systems that use PSALLOC=early is at least four
times the real memory size. Actually, this is just a starting point. Analyze the virtual storage requirements
of your workload and allocate paging spaces to accommodate them. As an example, at one time, the
AIXwindows server required 250 MB of paging space when run with early allocation.
When using PSALLOC=early, the user should set a handler for the following SIGSEGV signal by
pre-allocating and setting the memory as a stack using the sigaltstack function. Even though
PSALLOC=early is specified, when there is not enough paging space and a program attempts to expand
the stack, the program may receive the SIGSEGV signal.
# vmo -o defps=0
# vmo -o defps=1
In general, system performance can be improved by DPSA, because the overhead of allocating page
space after page faults is avoided the. Paging space devices need less disk space if DPSA is used.
For further information, see “Page space allocation” on page 140 and “Paging spaces placement and
sizes” on page 88.
Within each volume group, one or more logical volumes (LVs) are defined. Each logical volume consists
of one or more logical partitions. Each logical partition corresponds to at least one physical partition.
If mirroring is specified for the logical volume, additional physical partitions are allocated to store the
additional copies of each logical partition. Although the logical partitions are numbered consecutively, the
underlying physical partitions are not necessarily consecutive or contiguous.
Performance management 49
Logical volumes can serve a number of system purposes, such as paging, but each logical volume that
holds ordinary system data or user data or programs contains a single journaled file system (JFS or
Enhanced JFS). Each JFS consists of a pool of page-size (4096-byte) blocks. When data is to be written to
a file, one or more additional blocks are allocated to that file. These blocks may or may not be contiguous
with one another and with other blocks previously allocated to the file.
For purposes of illustration, the previous figure shows a bad (but not the worst possible) situation that
might arise in a file system that had been in use for a long period without reorganization. The /op/
filename file is physically recorded on a large number of blocks that are physically distant from one
another. Reading the file sequentially would result in many time-consuming seek operations.
While an operating system's file is conceptually a sequential and contiguous string of bytes, the physical
reality might be very different. Fragmentation may arise from multiple extensions to logical volumes as
well as allocation/release/reallocation activity within a file system. A file system is fragmented when its
available space consists of large numbers of small chunks of space, making it impossible to write out a
new file in contiguous blocks.
Access to files in a highly fragmented file system may result in a large number of seeks and longer
I/O response times (seek latency dominates I/O response time). For example, if the file is accessed
sequentially, a file placement that consists of many, widely separated chunks requires more seeks than
a placement that consists of one or a few large contiguous chunks. If the file is accessed randomly, a
placement that is widely dispersed requires longer seeks than a placement in which the file's blocks are
close together.
The effect of a file's placement on I/O performance diminishes when the file is buffered in memory. When
a file is opened in the operating system, it is mapped to a persistent data segment in virtual memory. The
segment represents a virtual buffer for the file; the file's blocks map directly to segment pages. The VMM
manages the segment pages, reading file blocks into segment pages upon demand (as they are accessed).
There are several circumstances that cause the VMM to write a page back to its corresponding block in
the file on disk; but, in general, the VMM keeps a page in memory if it has been accessed recently. Thus,
frequently accessed pages tend to stay in memory longer, and logical file accesses to the corresponding
blocks can be satisfied without physical disk accesses.
At some point, the user or system administrator can choose to reorganize the placement of files within
logical volumes and the placement of logical volumes within physical volumes to reduce fragmentation
and to more evenly distribute the total I/O load. “Logical volume and disk I/O performance” on page 161
contains further details about detecting and correcting disk placement and fragmentation problems.
Multiprocessing
At any given time, a technological limit exists on the speed with which a single processor chip can
operate. If a system's workload cannot be handled satisfactorily by a single processor, one response is to
apply multiple processors to the problem.
The success of this response depends not only on the skill of the system designers, but also on whether
the workload is amenable to multiprocessing. In terms of human tasks, adding people might be a good
idea if the task is answering calls to a toll-free number, but is dubious if the task is driving a car.
If improved performance is the objective of a proposed migration from a uniprocessor to a multiprocessor
system, the following conditions must be true:
• The workload is processor-limited and has saturated its uniprocessor system.
• The workload contains multiple processor-intensive elements, such as transactions or complex
calculations, that can be performed simultaneously and independently.
• The existing uniprocessor cannot be upgraded or replaced with another uniprocessor of adequate
power.
Although unchanged single-thread applications normally function correctly in a multiprocessor
environment, their performance often changes in unexpected ways. Migration to a multiprocessor can
improve the throughput of a system, and can improve the execution time of complex, multithreaded
applications, but seldom improves the response time of individual, single-thread commands.
Getting the best possible performance from a multiprocessor system requires an understanding of the
operating-system and hardware-execution dynamics that are unique to the multiprocessor environment.
Types of multiprocessing
Several categories of multiprocessing (MP) systems exist.
Shared nothing MP
The processors share nothing (each has its own memory, caches, and disks), but they are interconnected.
This type of muliprocessing is also called a pure cluster.
Each processor is a complete stand-alone machine and runs a copy of the operating system. When
LAN-connected, processors are loosely coupled. When connected by a switch, the processors are tightly
coupled. Communication between processors is done through message-passing.
The advantages of such a system are very good scalability and high availability. The disadvantages of such
a system are an unfamiliar programming model (message passing).
Shared disks MP
The advantages of shared disks are that part of a familiar programming model is retained (disk data is
addressable and coherent, memory is not), and high availability is much easier than with shared-memory
Performance management 51
systems. The disadvantages are limited scalability due to bottlenecks in physical and logical access to
shared data.
Processors have their own memory and cache. The processors run in parallel and share disks. Each
processor runs a copy of the operating system and the processors are loosely coupled (connected through
LAN). Communication between processors is done through message-passing.
Shared memory MP
All of the processors are tightly coupled inside the same box with a high-speed bus or a switch. The
processors share the same global memory, disks, and I/O devices. Only one copy of the operating system
runs across all of the processors, and the operating system must be designed to exploit this architecture
(multithreaded operating system).
SMPs have several advantages:
• They are a cost-effective way to increase throughput.
• They offer a single system image since the Operating System is shared between all the processors
(administration is easy).
• They apply multiple processors to a single problem (parallel programming).
• Load balancing is done by the operating system.
• The uniprocessor (UP) programming model can be used in an SMP.
• They are scalable for shared data.
• All data is addressable by all the processors and kept coherent by the hardware snooping logic.
• There is no need to use message-passing libraries to communicate between processors because
communication is done through the global shared memory.
• More power requirements can be solved by adding more processors to the system. However, you must
set realistic expectations about the increase in performance when adding more processors to an SMP
system.
• More and more applications and tools are available today. Most UP applications can run on or are ported
to SMP architecture.
There are some limitations of SMP systems, as follows:
• There are limits on scalability due to cache coherency, locking mechanism, shared objects, and others.
• There is a need for new skills to exploit multiprocessors, such as threads programming and device
drivers programming.
Parallelizing an application
An application can be parallelized on an SMP in one of two ways.
• The traditional way is to break the application into multiple processes. These processes communicate
using inter-process communication (IPC) such as pipes, semaphores or shared memory. The processes
must be able to block waiting for events such as messages from other processes, and they must
coordinate access to shared objects with something like locks.
• Another way is to use the portable operating system interface for UNIX (POSIX) threads. Threads
have similar coordination problems as processes and similar mechanisms to deal with them. Thus a
single process can have any number of its threads running simultaneously on different processors.
Coordinating them and serializing access to shared data are the developer's responsibility.
Data serialization
Any storage element that can be read or written by more than one thread may change while the program
is running.
This is generally true of multiprogramming environments as well as multiprocessing environments, but
the advent of multiprocessors adds to the scope and importance of this consideration in two ways:
• Multiprocessors and thread support make it attractive and easier to write applications that share data
among threads.
• The kernel can no longer solve the serialization problem simply by disabling interrupts.
Note: To avoid serious problems, programs that share data must arrange to access that data serially,
rather than in parallel. Before a program updates a shared data item, it must ensure that no other
program (including another copy of itself running on another thread) will change the item. Reads can
usually be done in parallel.
The primary mechanism that is used to keep programs from interfering with one another is the lock.
A lock is an abstraction that represents permission to access one or more data items. Lock and
unlock requests are atomic; that is, they are implemented in such a way that neither interrupts nor
multiprocessor access affect the outcome. All programs that access a shared data item must obtain the
lock that corresponds to that data item before manipulating it. If the lock is already held by another
program (or another thread running the same program), the requesting program must defer its access
until the lock becomes available.
Besides the time spent waiting for the lock, serialization adds to the number of times a thread
becomes nondispatchable. While the thread is nondispatchable, other threads are probably causing the
nondispatchable thread's cache lines to be replaced, which results in increased memory-latency costs
when the thread finally gets the lock and is dispatched.
The operating system's kernel contains many shared data items, so it must perform serialization
internally. Serialization delays can therefore occur even in an application program that does not share
data with other programs, because the kernel services used by the program have to serialize shared
kernel data.
Locks
Use locks to allocate and free internal operating system memory.
For more information, see Understanding Locking.
Types of locks
The Open Software Foundation/1 (OSF/1) 1.1 locking methodology was used as a model for the AIX
multiprocessor lock functions.
However, because the system is preemptable and pageable, some characteristics have been added to
the OSF/1 1.1 Locking Model. Simple locks and complex locks are preemptable. Also, a thread may sleep
when trying to acquire a busy simple lock if the owner of the lock is not currently running. In addition,
a simple lock becomes a sleep lock when a processor has been spinning on a simple lock for a certain
amount of time (this amount of time is a system-wide variable).
Performance management 53
Simple locks
A simple lock in operating system version 4 is a spin lock that will sleep under certain conditions
preventing a thread from spinning indefinitely.
Simple locks are preemptable, meaning that a kernel thread can be preempted by another higher priority
kernel thread while it holds a simple lock. On a multiprocessor system, simple locks, which protect
thread-interrupt critical sections, must be used in conjunction with interrupt control in order to serialize
execution both within the executing processor and between different processors.
On a uniprocessor system, interrupt control is sufficient; there is no need to use locks. Simple locks are
intended to protect thread-thread and thread-interrupt critical sections. Simple locks will spin until the
lock becomes available if in an interrupt handler. They have two states: locked or unlocked.
Complex locks
The complex locks in AIX are read-write locks that protect thread-thread critical sections. These locks are
preemptable.
Complex locks are spin locks that will sleep under certain conditions. By default, they are not recursive,
but can become recursive through the lock_set_recursive() kernel service. They have three states:
exclusive-write, shared-read, or unlocked.
Lock granularity
A programmer working in a multiprocessor environment must decide how many separate locks must
be created for shared data. If there is a single lock to serialize the entire set of shared data items,
lock contention is comparatively likely. The existence of widely used locks places an upper limit on the
throughput of the system.
If each distinct data item has its own lock, the probability of two threads contending for that lock is
comparatively low. Each additional lock and unlock call costs processor time, however, and the existence
of multiple locks makes a deadlock possible. At its simplest, deadlock is the situation shown in the
following illustration, in which Thread 1 owns Lock A and is waiting for Lock B. Meanwhile, Thread 2 owns
Lock B and is waiting for Lock A. Neither program will ever reach the unlock() call that would break the
deadlock. The usual preventive for deadlock is to establish a protocol by which all of the programs that
use a given set of locks must always acquire them in exactly the same sequence.
Locking overhead
Requesting locks, waiting for locks, and releasing locks add processing overhead.
• A program that supports multiprocessing always does the same lock and unlock processing, even
though it is running in a uniprocessor or is the only user in a multiprocessor system of the locks in
question.
• When one thread requests a lock held by another thread, the requesting thread may spin for a while or
be put to sleep and, if possible, another thread dispatched. This consumes processor time.
• The existence of widely used locks places an upper bound on the throughput of the system. For
example, if a given program spends 20 percent of its execution time holding a mutual-exclusion lock,
Performance management 55
at most five instances of that program can run simultaneously, regardless of the number of processors
in the system. In fact, even five instances would probably never be so nicely synchronized as to avoid
waiting for one another (see “Multiprocessor throughput scalability ” on page 59).
Cache coherency
In designing a multiprocessor, engineers give considerable attention to ensuring cache coherency. They
succeed; but cache coherency has a performance cost.
We need to understand the problem being attacked:
If each processor has a cache that reflects the state of various parts of memory, it is possible that two
or more caches may have copies of the same line. It is also possible that a given line may contain more
than one lockable data item. If two threads make appropriately serialized changes to those data items,
the result could be that both caches end up with different, incorrect versions of the line of memory. In
other words, the system's state is no longer coherent because the system contains two different versions
of what is supposed to be the content of a specific area of memory.
The solutions to the cache coherency problem usually include invalidating all but one of the duplicate
lines when the line is modified. Although the hardware uses snooping logic to invalidate, without any
software intervention, any processor whose cache line has been invalidated will have a cache miss, with
its attendant delay, the next time that line is addressed.
Snooping is the logic used to resolve the problem of cache consistency. Snooping logic in the processor
broadcasts a message over the bus each time a word in its cache has been modified. The snooping logic
also snoops on the bus looking for such messages from other processors.
When a processor detects that another processor has changed a value at an address existing in its
own cache, the snooping logic invalidates that entry in its cache. This is called cross invalidate. Cross
invalidate reminds the processor that the value in the cache is not valid, and it must look for the
correct value somewhere else (memory or other cache). Since cross invalidates increase cache misses
and the snooping protocol adds to the bus traffic, solving the cache consistency problem reduces the
performance and scalability of all SMPs.
Workload concurrency
The primary performance issue that is unique to SMP systems is workload concurrency, which can be
expressed as, "Now that we have n processors, how do we keep them all usefully employed"?
If only one processor in a four-way multiprocessor system is doing useful work at any given time, it is no
better than a uniprocessor. It could possibly be worse, because of the extra code to avoid interprocessor
interference.
Workload concurrency is the complement of serialization. To the extent that the system software or the
application workload (or the interaction of the two) require serialization, workload concurrency suffers.
Workload concurrency may also be decreased, more desirably, by increased processor affinity. The
improved cache efficiency gained from processor affinity may result in quicker completion of the program.
Workload concurrency is reduced (unless there are more dispatchable threads available), but response
time is improved.
A component of workload concurrency, process concurrency, is the degree to which a multithreaded
process has multiple dispatchable threads at all times.
Throughput
The throughput of an SMP system is mainly dependent on several factors.
• A consistently high level of workload concurrency. More dispatchable threads than processors at certain
times cannot compensate for idle processors at other times.
• The amount of lock contention.
• The degree of processor affinity.
Performance management 57
Response time
The response time of a particular program in an SMP system is dependent on several factors.
• The process-concurrency level of the program. If the program consistently has two or more
dispatchable threads, its response time will probably improve in an SMP environment. If the program
consists of a single thread, its response time will be, at best, comparable to that in a uniprocessor of the
same speed.
• The amount of lock contention of other instances of the program or with other programs that use the
same locks.
• The degree of processor affinity of the program. If each dispatch of the program is to a different
processor that has none of the program's cache lines, the program may run more slowly than in a
comparable uniprocessor.
SMP workloads
The effect of additional processors on performance is dominated by certain characteristics of the specific
workload being handled. This section discusses those critical characteristics and their effects.
The following terms are used to describe the extent to which an existing program has been modified, or a
new program designed, to operate in an SMP environment:
SMP safe
Avoidance in a program of any action, such as unserialized access to shared data, that would
cause functional problems in an SMP environment. This term, when used alone, usually refers to
a program that has undergone only the minimum changes necessary for correct functioning in an SMP
environment.
SMP efficient
Avoidance in a program of any action that would cause functional or performance problems in an
SMP environment. A program that is described as SMP-efficient is SMP-safe as well. An SMP-efficient
program has usually undergone additional changes to minimize incipient bottlenecks.
SMP exploiting
Adding features to a program that are specifically intended to make effective use of an SMP
environment, such as multithreading. A program that is described as SMP-exploiting is generally
assumed to be SMP-safe and SMP-efficient as well.
Workload multiprocessing
Multiprogramming operating systems running heavy workloads on fast computers give our human senses
the impression that several things are happening simultaneously.
In fact, many demanding workloads do not have large numbers of dispatchable threads at any given
instant, even when running on a single-processor system where serialization is less of a problem. Unless
there are always at least as many dispatchable threads as there are processors, one or more processors
will be idle part of the time.
The number of dispatchable threads is the total number of threads in the system
• Minus the number of threads that are waiting for I/O,
• Minus the number of threads that are waiting for a shared resource,
• Minus the number of threads that are waiting for the results of another thread,
• Minus the number of threads that are sleeping at their own request.
A workload can be said to be multiprocessable to the extent that it presents at all times as many
dispatchable threads as there are processors in the system. Note that this does not mean simply an
average number of dispatchable threads equal to the processor count. If the number of dispatchable
threads is zero half the time and twice the processor count the rest of the time, the average number
of dispatchable threads will equal the processor count, but any given processor in the system will be
working only half the time.
Performance management 59
Figure 13. Multiprocessor Scaling
On the multiprocessor, two processors handle program execution, but there is still only one lock.
For simplicity, all of the lock contention is shown affecting processor B. In the period shown, the
multiprocessor handles 14 commands. The scaling factor is thus 1.83. We stop at two processors
because more would not change the situation. The lock is now in use 100 percent of the time. In a
four-way multiprocessor, the scaling factor would be 1.83 or less.
Real programs are seldom as symmetrical as the commands in the illustration. In addition we have
only taken into account one dimension of contention: locking. If we had included cache-coherency and
processor-affinity effects, the scaling factor would almost certainly be lower.
This example illustrates that workloads often cannot be made to run faster simply by adding processors.
It is also necessary to identify and minimize the sources of contention among the threads.
Scaling is workload-dependent. Some published benchmark results imply that high levels of scalability
are easy to achieve. Most such benchmarks are constructed by running combinations of small, CPU-
intensive programs that use almost no kernel services. These benchmark results represent an upper
bound on scalability, not a realistic expectation.
Another interesting point to note for benchmarks is that in general, a one-way SMP will run slower (about
5-15 percent) than the equivalent uniprocessor running the UP version of the operating system.
As an example, if 50 percent of a program's processing must be done sequentially, and 50 percent can be
done in parallel, the maximum response-time improvement is less than a factor of 2 (in an otherwise-idle
4-way multiprocessor, it is at most 1.6).
Performance management 61
Default scheduler processing of migrated workloads
The division between processes and threads is invisible to existing programs.
In fact, workloads migrated directly from earlier releases of the operating system create processes as
they have always done. Each new process is created with a single thread (the initial thread) that contends
for the CPU with the threads of other processes.
The default attributes of the initial thread, in conjunction with the new scheduler algorithms, minimize
changes in system dynamics for unchanged workloads.
Priorities can be manipulated with the nice and renice commands and the setpri() and setpriority()
system calls, as before. The scheduler allows a given thread to run for at most one time slice (normally
10 ms) before forcing it to yield to the next dispatchable thread of the same or higher priority. See
“Controlling contention for the microprocessor” on page 109 for more detail.
Performance management 63
Thread environment variables
Within the libpthreads.a framework, a series of tuning knobs have been provided that might impact
the performance of the application.
If possible, use a front-end shell script to invoke the binary executable programs. The shell script should
specify the new values that you want to override the system defaults for the environment variables
described in the sections that follow.
AIXTHREAD_COND_DEBUG
The AIXTHREAD_COND_DEBUG variable maintains a list of condition variables for use by the debugger. If
the program contains a large number of active condition variables and frequently creates and destroys
condition variables, this can create higher overhead for maintaining the list of condition variables. Setting
the variable to OFF will disable the list. Leaving this variable turned on makes debugging threaded
applications easier, but can impose some overhead.
AIXTHREAD_ENRUSG
The AIXTHREAD_ENRUSG variable enables or disables the pthread resource collection. Turning it on
allows for resource collection of all pthreads in a process, but will impose some overhead.
AIXTHREAD_GUARDPAGES=n
* +-----------------------+
* | pthread attr |
* +-----------------------+ <--- pthread->pt_attr
* | pthread struct |
* +-----------------------+ <--- pthread->pt_stk.st_limit
* | pthread stack |
* | | |
* | V |
* +-----------------------+ <--- pthread->pt_stk.st_base
* | RED ZONE |
* +-----------------------+ <--- pthread->pt_guardaddr
* | pthread private data |
* +-----------------------+ <--- pthread->pt_data
* +-----------------------+
* | page alignment 2 |
* | [8K-4K+PTH_FIXED-a1] |
* +-----------------------+
* | pthread ctx [368] |
* +-----------------------+<--- pthread->pt_attr
* | pthread attr [112] |
* +-----------------------+ <--- pthread->pt_attr
* | pthread struct [960] |
* +-----------------------+ <--- pthread
* | pthread stack | pthread->pt_stk.st_limit
* | |[96K+4K-PTH_FIXED] |
* | V |
* +-----------------------+ <--- pthread->pt_stk.st_base
* | RED ZONE [4K] |
* +-----------------------+ <--- pthread->pt_guardaddr
* | pthread key data [4K] |
* +-----------------------+ <--- pthread->pt_data
* | page alignment 1 (a1) |
* | [<4K] |
* +-----------------------+
AIXTHREAD_DISCLAIM_GUARDPAGES
The AIXTHREAD_DISCLAIM_GUARDPAGES variable controls whether the stack guardpages are
disclaimed when a pthread stack is created. If AIXTHREAD_DISCLAIM_GUARDPAGES=ON, the
guardpages are disclaimed. If a pthread stack does not have any guardpages, setting the
AIXTHREAD_DISCLAIM_GUARDPAGES variable has no effect.
AIXTHREAD_MNRATIO
The AIXTHREAD_MNRATIO variable controls the scaling factor of the library. This ratio is used when
creating and terminating pthreads. It may be useful for applications with a very large number of threads.
However, always test a ratio of 1:1 because it may provide for better performance.
Note: Starting from AIX 7.3, this environment variable does not affect the scaling factor of the library
because the M:N thread model is no longer supported.
AIXTHREAD_MUTEX_DEBUG
The AIXTHREAD_MUTEX_DEBUG variable maintains a list of active mutexes for use by the debugger. If the
program contains a large number of active mutexes and frequently creates and destroys mutexes, this can
create higher overhead for maintaining the list of mutexes. Setting the variable to ON makes debugging
threaded applications easier, but may impose the additional overhead. Leaving the variable set to OFF
disables the list.
AIXTHREAD_MUTEX_FAST
If the program experiences performance degradation due to heavy mutex contention, then setting
this variable to ON will force the pthread library to use an optimized mutex locking mechanism that
works only on process-private mutexes. These process-private mutexes must be initialized using the
pthread_mutex_init routine and must be destroyed using the pthread_mutex_destroy routine. Leaving the
variable set to OFF forces the pthread library to use the default mutex locking mechanism.
AIXTHREAD_READ_GUARDPAGES
The AIXTHREAD_READ_GUARDPAGES variable enables or disables read access to the guardpages that are
added to the end of the pthread stack. For more information about guardpages that are created by the
pthread, see “AIXTHREAD_GUARDPAGES=n” on page 64.
AIXTHREAD_RWLOCK_DEBUG
The AIXTHREAD_RWLOCK_DEBUG variable maintains a list of read-write locks for use by the debugger.
If the program contains a large number of active read-write locks and frequently creates and destroys
read-write locks, this may create higher overhead for maintaining the list of read-write locks. Setting the
variable to OFF will disable the list.
AIXTHREAD_SUSPENDIBLE={ON|OFF}
Setting the AIXTHREAD_SUSPENDIBLE variable to ON prevents deadlock in applications that use the
following routines with the pthread_suspend_np routine or the pthread_suspend_others_np routine:
• pthread_getrusage_np
Performance management 65
• pthread_cancel
• pthread_detach
• pthread_join
• pthread_getunique_np
• pthread_join_np
• pthread_setschedparam
• pthread_getschedparam
• pthread_kill
There is a small performance penalty associated with this variable.
AIXTHREAD_SCOPE={S|P}
The S option signifies a system-wide contention scope (1:1), and the P option signifies a process-wide
contention scope (M:N). One of these options must be specified; the default value is S.
Use of the AIXTHREAD_SCOPE environment variable impacts only those threads created with the
default attribute. The default attribute is employed when the attr parameter of the pthread_create()
subroutine is NULL.
If a user thread is created with system-wide scope, it is bound to a kernel thread and it is scheduled by
the kernel. The underlying kernel thread is not shared with any other user thread.
If a user thread is created with process-wide scope, it is subject to the user scheduler, which means the
following
• It does not have a dedicated kernel thread.
• It sleeps in user mode.
• It is placed on the user run queue when it is waiting for a processor.
• It is subjected to time slicing by the user scheduler.
Tests show that some applications can perform better with the 1:1 model.
Note: Starting from AIX 7.3, this environment variable does not affect the contention scope because
only system-wide contention scope (1:1 model) is supported.
AIXTHREAD_SLPRATIO
The AIXTHREAD_SLPRATIO thread tuning variable controls the number of kernel threads that should be
held in reserve for sleeping threads. In general, fewer kernel threads are required to support sleeping
pthreads because they are generally woken one at a time. This conserves kernel resources.
Note: Starting from AIX 7.3, this environment variable does not affect the number of kernel threads
that must be reserved for the sleeping threads because the M:N thread model is no longer supported.
AIXTHREAD_STK=n
The AIXTHREAD_STK=n thread tuning variable controls the decimal number of bytes that should be
allocated for each pthread. This value may be overridden by pthread_attr_setstacksize.
AIXTHREAD_AFFINITY={default|strict|first-touch}
The AIXTHREAD_AFFINITY controls the placement of pthread structures, stacks, and thread-local storage
on an enhanced affinity enabled system.
• The default option will not attempt any special placement of this data, balancing it over the memory
regions used by the process as determined by the system settings
AIXTHREAD_PREALLOC=n
The AIXTHREAD_PREALLOC variable designates the number of bytes to pre-allocate and free during
thread creation. Some multi-threaded applications may benefit from this by avoiding calls to sbrk() from
multiple threads simultaneously.
The default is 0 and n must be a positive value.
AIXTHREAD_HRT
The AIXTHREAD_HRT=true variable allow high-resolution time-outs for application's pthreads. You
must have root authority, or CAP_NUMA_ATTACH capability to enable high-resolution time-outs. This
environment variable is ignored, if you do not have the required authority or capabilities.
MALLOCBUCKETS
Malloc buckets provides an optional buckets-based extension of the default allocator. It is intended
to improve malloc performance for applications that issue large numbers of small allocation requests.
When malloc buckets is enabled, allocation requests that fall within a predefined range of block sizes
are processed by malloc buckets. All other requests are processed in the usual manner by the default
allocator.
Malloc buckets is not enabled by default. It is enabled and configured prior to process startup by setting
the MALLOCTYPE and MALLOCBUCKETS environment variables.
For more information on malloc buckets, see General Programming Concepts: Writing and Debugging
Programs.
MALLOCMULTIHEAP={considersize,heaps:n}
Multiple heaps are required so that a threaded application can have more than one thread issuing
malloc(), free(), and realloc() subroutine calls. With a single heap, all threads trying to do a malloc(),
free(), or realloc() call would be serialized (that is, only one call at a time). The result is a serious impact
on multi-processor machines. With multiple heaps, each thread gets its own heap. If all heaps are being
used, then any new threads trying to do a call will have to wait until one or more of the heaps is available.
Serialization still exists, but the likelihood of its occurrence and its impact when it does occur are greatly
reduced.
The thread-safe locking has been changed to handle this approach. Each heap has its own lock, and the
locking routine "intelligently" selects a heap to try to prevent serialization. If the considersize option is
set in the MALLOCMULTIHEAP environment variable, then the selection will also try to select any available
heap that has enough free space to handle the request instead of just selecting the next unlocked heap.
More than one option can be specified (and in any order) as long as they are comma-separated, for
example:
MALLOCMULTIHEAP=considersize,heaps:3
Performance management 67
heaps:n
Use this option to change the number of heaps. The valid range for n is 1 to 32. If you set n to a
number outside of this range (that is, if n<=0 or n>32), n will be set to 32.
The default for MALLOCMULTIHEAP is NOT SET (only the first heap is used). If the environment variable
MALLOCMULTIHEAP is set (for example, MALLOCMULTIHEAP=1) then the threaded application will be able
to use all of the 32 heaps. Setting MALLOCMULTIHEAP=heaps:n will limit the number of heaps to n
instead of the 32 heaps.
For more information, see the Malloc Multiheap section in General Programming Concepts: Writing and
Debugging Programs.
SPINLOOPTIME=n
The SPINLOOPTIME variable controls the number of times the system tries to get a busy mutex or spin
lock without taking a secondary action such as calling the kernel to yield the process. This control is
intended for MP systems, where it is hoped that the lock being held by another actively running pthread
will be released. The parameter works only within libpthreads (user threads). If locks are usually available
within a short amount of time, you may want to increase the spin time by setting this environment
variable. The number of times to retry a busy lock before yielding to another pthread is n. The default is 40
and n must be a positive value.
The MAXSPIN kernel parameter affects spinning in the kernel lock routines (see “Using the schedo
command to modify the MAXSPIN parameter” on page 71).
YIELDLOOPTIME=n
The YIELDLOOPTIME variable controls the number of times that the system yields the processor when
trying to acquire a busy mutex or spin lock before actually going to sleep on the lock. The processor is
yielded to another kernel thread, assuming there is another executable one with sufficient priority. This
variable has been shown to be effective in complex applications, where multiple locks are in use. The
number of times to yield the processor before blocking on a busy lock is n. The default is 0 and n must be
a positive value.
Note: Starting from AIX 7.3, the process-wide contention scope is no longer supported. The following
environment variables do not affect the scheduling of pthreads.
AIXTHREAD_MNRATIO=p:k
where k is the number of kernel threads that should be employed to handle p runnable pthreads. This
environment variable controls the scaling factor of the library. This ratio is used when creating and
terminating pthreads. The variable is only valid with process-wide scope; with system-wide scope,
this environment variable is ignored. The default setting is 8:1.
AIXTHREAD_SLPRATIO=k:p
where k is the number of kernel threads that should be held in reserve for p sleeping pthreads. The
sleep ratio is the number of kernel threads to keep on the side in support of sleeping pthreads. In
general, fewer kernel threads are required to support sleeping pthreads, since they are generally
woken one at a time. This conserves kernel resources. Any positive integer value may be specified for
p and k. If k>p, then the ratio is treated as 1:1. The default is 1:12.
AIXTHREAD_MINKTHREADS=n
where n is the minimum number of kernel threads that should be used. The library scheduler will
not reclaim kernel threads below this figure. A kernel thread may be reclaimed at virtually any point.
Generally, a kernel thread is targeted for reclaim as a result of a pthread terminating. The default is 8.
# export variable_name=ON
SMP tools
All performance tools of the operating system support SMP machines.
Some performance tools provide individual processor utilization statistics. Other performance tools
average out the utilization statistics for all processors and display only the averages.
This section describes the tools that are only supported on SMP. For details on all other performance
tools, see the appropriate sections.
# bindprocessor -q
The available processors are: 0 1 2 3
The output shows the logical processor numbers for the available processors, which are used with the
bindprocessor command as will be seen.
To bind a process whose PID is 14596 to processor 1, run the following:
# bindprocessor 14596 1
No return message is given if the command was successful. To verify if a process is bound or unbound to a
processor, use the ps -mo THREAD command as explained in “Using the ps command” on page 102:
# ps -mo THREAD
USER PID PPID TID ST CP PRI SC WCHAN F TT BND COMMAND
root 3292 7130 - A 1 60 1 - 240001 pts/0 - -ksh
- - - 14309 S 1 60 1 - 400 - - -
root 14596 3292 - A 73 100 1 - 200001 pts/0 1 /tmp/cpubound
- - - 15629 R 73 100 1 - 0 - 1 -
root 15606 3292 - A 74 101 1 - 200001 pts/0 - /tmp/cpubound
- - - 16895 R 74 101 1 - 0 - - -
root 16634 3292 - A 73 100 1 - 200001 pts/0 - /tmp/cpubound
- - - 15107 R 73 100 1 - 0 - - -
Performance management 69
root 18048 3292 - A 14 67 1 - 200001 pts/0 - ps -mo THREAD
- - - 17801 R 14 67 1 - 0 - - -
The BND column shows the number of the processor that the process is bound to or a dash (-) if the
process is not bound at all.
To unbind a process whose PID is 14596, use the following command:
# bindprocessor -u 14596
# ps -mo THREAD
USER PID PPID TID ST CP PRI SC WCHAN F TT BND COMMAND
root 3292 7130 - A 2 61 1 - 240001 pts/0 - -ksh
- - - 14309 S 2 61 1 - 400 - - -
root 14596 3292 - A 120 124 1 - 200001 pts/0 - /tmp/cpubound
- - - 15629 R 120 124 1 - 0 - - -
root 15606 3292 - A 120 124 1 - 200001 pts/0 - /tmp/cpubound
- - - 16895 R 120 124 1 - 0 - - -
root 16634 3292 - A 120 124 0 - 200001 pts/0 - /tmp/cpubound
- - - 15107 R 120 124 0 - 0 - - -
root 18052 3292 - A 12 66 1 - 200001 pts/0 - ps -mo THREAD
- - - 17805 R 12 66 1 - 0 - - -
When the bindprocessor command is used on a process, all of its threads will then be bound to
one processor and unbound from their former processor. Unbinding the process will also unbind all its
threads. You cannot bind or unbind an individual thread using the bindprocessor command.
However, within a program, you can use the bindprocessor() function call to bind individual threads. If
the bindprocessor() function is used within a piece of code to bind threads to processors, the threads
remain with these processors and cannot be unbound. If the bindprocessor command is used on that
process, all of its threads will then be bound to one processor and unbound from their respective former
processors. An unbinding of the whole process will also unbind all the threads.
A process cannot be bound until it is started; that is, it must exist in order to be bound. When a process
does not exist, the following error displays:
# bindprocessor 7359 1
1730-002: Process 7359 does not match an existing process
# bindprocessor 7358 4
1730-001: Processor 4 is not available
Note: Do not use the bindprocessor command on the wait processes kproc.
Binding considerations
There are several issues to consider before you use the process binding.
Binding can be useful for CPU-intensive programs that experience few interrupts. It can sometimes be
counterproductive for ordinary programs because it might delay the redispatch of a thread after an I/O
until the processor to which the thread is bound becomes available. If the thread is blocked during an
I/O operation, it is unlikely that much of its processing context remains in the caches of the processor to
which it is bound. The thread is better served if it is dispatched to the next available processor.
Binding does not prevent other processes from being dispatched on the processor on which you bound
your process. Binding is different from partitioning. Using rsets or exclusive rsets allows a set of
logical processors to be dedicated for a specific workload. Therefore, a higher priority process might be
dispatched on the processor where you bind your process. In this case, your process is not dispatched
on other processors, and therefore, you the performance of the bound process is not increased. Better
results might be achieved if you increase the priority of the bound process.
If you bind a process on a heavily loaded system, you might decrease its performance because when a
processor becomes idle, the process is not able to run on the idle processor if it is not the processor on
which the process is bound.
If the process is multithreaded, binding the process binds all its threads to the same processor. Therefore,
the process does not take advantage of the multiprocessing, and performance is not improved.
# schedo -o maxspin=8192
Performance management 71
Workload component identification
Whether the program is new or purchased, small or large, the developers, the installers, and the
prospective users have assumptions about the use of the program.
Some of these assumptions may be:
• Who will be using the program
• Situations in which the program will be run
• How often those situations will arise and at what times of the hour, day, month, or year
• Whether those situations will also require additional uses of existing programs
• Which systems the program will run on
• How much data will be handled, and from where
• Whether data created by or for the program will be used in other ways
Unless these ideas are elicited as part of the design process, they will probably be vague, and the
programmers will almost certainly have different assumptions than the prospective users. Even in the
apparently trivial case in which the programmer is also the user, leaving the assumptions unarticulated
makes it impossible to compare design to assumptions in any rigorous way. Worse, it is impossible to
identify performance requirements without a complete understanding of the work being performed.
Performance management 73
In estimating resources, we are primarily interested in four dimensions (in no particular order):
CPU time
Processor cost of the workload
Disk accesses
Rate at which the workload generates disk reads or writes
LAN traffic
Number of packets the workload generates and the number of bytes of data exchanged
Real memory
Amount of RAM the workload requires
The following sections discuss how to determine these values in various situations.
# vmstat 5 >vmstat.output
This gives us a picture of the state of the system every 5 seconds during the measurement run. The
first set of vmstat output contains the cumulative data from the last boot to the start of the vmstat
command. The remaining sets are the results for the preceding interval, in this case 5 seconds. A typical
set of vmstat output on a system looks similar to the following:
# iostat 5 >iostat.output
This gives us a picture of the state of the system every 5 seconds during the measurement run. The
first set of iostat output contains the cumulative data from the last boot to the start of the iostat
command. The remaining sets are the results for the preceding interval, in this case 5 seconds. A typical
set of iostat output on a system looks similar to the following:
# svmon -G
In this example, the machine's 256 MB memory is fully used. About 83 percent of RAM is in use for
working segments, the read/write memory of running programs (the rest is for caching files). If there are
long-running processes in which we are interested, we can review their memory requirements in detail.
The following example determines the memory used by a process of user hoetzel.
# ps -fu hoetzel
UID PID PPID C STIME TTY TIME CMD
hoetzel 24896 33604 0 09:27:35 pts/3 0:00 /usr/bin/ksh
hoetzel 32496 25350 6 15:16:34 pts/5 0:00 ps -fu hoetzel
# svmon -P 24896
------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd LPage
24896 ksh 7547 4045 1186 7486 N N N
The working segment (5176), with 4 pages in use, is the cost of this instance of the ksh program. The
2619-page cost of the shared library and the 58-page cost of the ksh program are spread across all of the
running programs and all instances of the ksh program, respectively.
If we believe that our 256 MB system is larger than necessary, use the rmss command to reduce the
effective size of the machine and remeasure the workload. If paging increases significantly or response
time deteriorates, we have reduced memory too much. This technique can be continued until we find a
size that runs our workload without degradation. See “Memory requirements assessment with the rmss
command ” on page 128 for more information on this technique.
The primary command for measuring network usage is the netstat program. The following example
shows the activity of a specific Token-Ring interface:
# netstat -I tr0 5
input (tr0) output input (Total) output
packets errs packets errs colls packets errs packets errs colls
35552822 213488 30283693 0 0 35608011 213488 30338882 0 0
300 0 426 0 0 300 0 426 0 0
272 2 190 0 0 272 2 190 0 0
231 0 192 0 0 231 0 192 0 0
143 0 113 0 0 143 0 113 0 0
408 1 176 0 0 408 1 176 0 0
The first line of the report shows the cumulative network traffic since the last boot. Each subsequent line
shows the activity for the preceding 5-second interval.
Performance management 75
the average cost is relatively low. See “Performance-Limiting Resource identification” on page 30 for
more information on using the vmstat command.
to identify the processes of interest and report the cumulative CPU time consumption of those processes.
We can then use the svmon command (judiciously) to assess the memory use of the processes.
Performance management 77
Transforming program-level estimates to workload estimates
The best method for estimating peak and typical resource requirements is to use a queuing model such as
BEST/1.
Static models can be used, but you run the risk of overestimating or underestimating the peak resource.
In either case, you need to understand how multiple programs in a workload interact from the standpoint
of resource requirements.
If you are building a static model, use a time interval that is the specified worst-acceptable response time
for the most frequent or demanding program (usually they are the same). Determine which programs will
typically be running during each interval, based on your projected number of users, their think time, their
key entry rate, and the anticipated mix of operations.
Use the following guidelines:
• CPU time
– Add together the CPU requirements for the all of the programs that are running during the interval.
Include the CPU requirements of the disk and communications I/O the programs will be doing.
– If this number is greater than 75 percent of the available CPU time during the interval, consider fewer
users or more CPUs.
• Real Memory
– The operating system memory requirement scales with the amount of physical memory. Start with 6
to 8 MB for the operating system itself. The lower figure is for a standalone system. The latter figure is
for a system that is LAN-connected and uses TCP/IP and NFS.
– Add together the working segment requirements of all of the instances of the programs that will be
running during the interval, including the space estimated for the program's data structures.
– Add to that total the memory requirement of the text segment of each distinct program that will be
running (one copy of the program text serves all instances of that program). Remember that any (and
only) subroutines that are from unshared libraries will be part of the executable program, but the
libraries themselves will not be in memory.
– Add to the total the amount of space consumed by each of the shared libraries that will be used by
any program in the workload. Again, one copy serves all.
– To allow adequate space for some file caching and the free list, your total memory projection should
not exceed 80 percent of the size of the machine to be used.
• Disk I/O
– Add the number of I/Os implied by each instance of each program. Keep separate totals for I/Os to
small files (or randomly to large files) versus purely sequential reading or writing of large files (more
than 32 KB).
– Subtract those I/Os that you believe will be satisfied from memory. Any record that was read or
written in the previous interval is probably still available in the current interval. Beyond that, examine
the size of the proposed machine versus the total RAM requirements of the machine's workload. Any
space remaining after the operating system's requirement and the workload's requirements probably
contains the most recently read or written file pages. If your application's design is such that there is
a high probability that you will reuse recently accessed data, you can calculate an allowance for the
caching effect. Remember that the reuse is at the page level, not at the record level. If the probability
of reuse of a given record is low, but there are a lot of records per page, it is likely that some of the
records needed in any given interval will fall in the same page as other, recently used, records.
– Compare the net I/O requirements (disk I/Os per second per disk) to the approximate capabilities of
current disk drives. If the random or sequential requirement is greater than 75 percent of the total
corresponding capability of the disks that will hold application data, tuning (and possibly expansion)
will be needed when the application is in production.
• Communications I/O
Processor-limited programs
If the program is processor-limited because it consists almost entirely of numerical computation, the
chosen algorithm will have a major effect on the performance of the program.
The maximum speed of a truly processor-limited program is determined by:
• The algorithm used
• The source code and data structures created by the programmer
• The sequence of machine-language instructions generated by the compiler
• The sizes and structures of the processor's caches
• The architecture and clock rate of the processor itself (see “Determining microprocessor speed” on
page 380)
A discussion of alternative algorithms is beyond the scope of this topic collection. It is assumed that
computational efficiency has been considered in choosing the algorithm.
Given an algorithm, the only items in the preceding list that the programmer can affect are the source
code, the compiler options used, and possibly the data structures. The following sections deal with
techniques that can be used to improve the efficiency of an individual program for which the user has
the source code. If the source code is not available, attempt to use tuning or workload-management
techniques.
Performance management 79
often make inefficient use of the storage allocated to them, with adverse performance effects in small or
heavily loaded systems.
Taking the storage hierarchy into account means understanding and adapting to the general principles
of efficient programming in a cached or virtual-memory environment. Repackaging techniques can yield
significant improvements without recoding, and any new code should be designed with efficient storage
use in mind.
Two terms are essential to any discussion of the efficient use of hierarchical storage: locality of reference
and working set.
• The locality of reference of a program is the degree to which its instruction-execution addresses and
data references are clustered in a small area of storage during a given time interval.
• The working set of a program during that same interval is the set of storage blocks that are in use, or,
more precisely, the code or data that occupy those blocks.
A program with good locality of reference has a minimal working set, because the blocks that are in use
are tightly packed with executing code or data. A functionally equivalent program with poor locality of
reference has a larger working set, because more blocks are needed to accommodate the wider range of
addresses being accessed.
Because each block takes a significant amount of time to load into a given level of the hierarchy, the
objective of efficient programming for a hierarchical-storage system is to design and package code in such
a way that the working set remains as small as practical.
The following figure illustrates good and bad practice at a subroutine level. The first version of the
program is packaged in the sequence in which it was probably written. The first subroutine PriSub1
contains the entry point of the program. It always uses primary subroutines PriSub2 and PriSub3. Some
infrequently used functions of the program require secondary subroutines SecSub1 and SecSub2. On
rare occasions, the error subroutines ErrSub1 and ErrSub2 are needed.
The initial version of the program has poor locality of reference because it takes three pages of memory to
run in the normal case. The secondary and error subroutines separate the main path of the program into
three, physically distant sections.
The improved version of the program places the primary subroutines adjacent to one another and puts the
low-frequency function after that. The necessary error subroutines (which are rarely-used) are left at the
end of the executable program. The most common functions of the program can now be handled with only
one disk read and one page of memory instead of the three previously required.
Remember that locality of reference and working set are defined with respect to time. If a program works
in stages, each of which takes a significant time and uses a different set of subroutines, try to minimize
the working set of each stage.
Performance management 81
library, /usr/lib/libblas.a. A larger set of performance-tuned subroutines is the Engineering and
Scientific Subroutine Library (ESSL) licensed program.
The FORTRAN run-time environment must be installed to use the library. Users should generally use this
library for their matrix and vector operations because its subroutines are tuned to a degree that users are
unlikely to achieve by themselves.
If the data structures are controlled by the programmer, other efficiencies are possible. The general
principle is to pack frequently used data together whenever possible. If a structure contains frequently
accessed control information and occasionally accessed detailed data, make sure that the control
information is allocated in consecutive bytes. This will increase the probability that all of the control
information will be loaded into the cache with a single (or at least with the minimum number of) cache
misses.
Optimization levels
The degree to which the compiler will optimize the code it generates is controlled by the -O flag.
No optimization
In the absence of any version of the -O flag, the compiler generates straightforward code with no
instruction reordering or other attempt at performance improvement.
-O or -O2
These equivalent flags cause the compiler to optimize on the basis of conservative assumptions
about code reordering. Only explicit relaxations such as the #pragma directives are used. This level
performs no software pipelining, loop unrolling, or simple predictive commoning. It also constrains
the amount of memory the compiler can use.
-O3
This flag directs the compiler to be aggressive about the optimization techniques used and to use
as much memory as necessary for maximum optimization. This level of optimization may result
in functional changes to the program if the program is sensitive to floating-point exceptions, the
sign of zero, or precision effects of reordering calculations. These side effects can be avoided, at
some performance cost, by using the -qstrict option in combination with -O3. The -qhot option, in
#include <string.h>
Performance management 83
• When it is necessary to access a global variable (that is not shared with other threads), copy the value
into a local variable and use the copy.
Unless the global variable is accessed only once, it is more efficient to use the local copy.
• Use binary codes rather than strings to record and test for situations. Strings consume both data and
instruction space. For example, the sequence:
#define situation_1 1
#define situation_2 2
#define situation_3 3
int situation_val;
situation_val = situation_2;
. . .
if (situation_val == situation_1)
. . .
char situation_val[20];
strcpy(situation_val,"situation_2");
. . .
if ((strcmp(situation_val,"situation_1"))==0)
. . .
• When strings are necessary, use fixed-length strings rather than null-terminated variable-length strings
wherever possible.
The mem*() family of routines, such as memcpy(), is faster than the corresponding str*() routines, such
as strcpy(), because the str*() routines must check each byte for null and the mem*() routines do not.
Memory-limited programs
To programmers accustomed to struggling with the addressing limitations of, for instance, the DOS
environment, 256 MB virtual memory segments seem effectively infinite. The programmer is tempted to
ignore storage constraints and code for minimum path length and maximum simplicity. Unfortunately,
there is a drawback to this attitude.
Virtual memory is large, but it is variable-speed. The more memory used, the slower it becomes, and
the relationship is not linear. As long as the total amount of virtual storage actually being touched by
all programs (that is, the sum of the working sets) is slightly less than the amount of unpinned real
memory in the machine, virtual memory performs at about the speed of real memory. As the sum of the
working sets of all executing programs passes the number of available page frames, memory performance
degrades rapidly (if VMM memory load control is turned off) by up to two orders of magnitude. When
Performance management 85
Misuse of pinned storage
To avoid circularities and time-outs, a small fraction of the system must be pinned in real memory.
For this code and data, the concept of working set is meaningless, because all of the pinned information
is in real storage all the time, whether or not it is used. Any program (such as a user-written device driver)
that pins code or data must be carefully designed (or scrutinized, if ported) to ensure that only minimal
amounts of pinned storage are used. Some cautionary examples are as follows:
• Code is pinned on a load-module (executable file) basis. If a component has some object modules that
must be pinned and others that can be pageable, package the pinned object modules in a separate load
module.
• Pinning a module or a data structure because there might be a problem is irresponsible. The designer
should understand the conditions under which the information could be required and whether a page
fault could be tolerated at that point.
• Pinned structures whose required size is load-dependent, such as buffer pools, should be tunable by
the system administrator.
Performance management 87
– Large files that are heavily used and are normally accessed randomly, such as databases, must be
spread across two or more physical volumes.
Related concepts
Logical volume and disk I/O performance
This topic focuses on the performance of logical volumes and locally attached disk drives.
However, with Deferred Page Space Allocation, this guideline may tie up more disk space than required.
See “Page space allocation” on page 140 for more information.
Ideally, there should be several paging spaces of roughly equal size, each on a different physical disk
drive. If you decide to create additional paging spaces, create them on physical volumes that are more
lightly loaded than the physical volume in rootvg. When allocating paging space blocks, the VMM allocates
four blocks, in turn, from each of the active paging spaces that has space available. While the system is
booting, only the primary paging space (hd6) is active. Consequently, all paging-space blocks allocated
during boot are on the primary paging space. This means that the primary paging space should be
somewhat larger than the secondary paging spaces. The secondary paging spaces should all be of the
same size to ensure that the algorithm performed in turn can work effectively.
The lsps -a command gives a snapshot of the current utilization level of all the paging spaces on a
system. You can also used the psdanger() subroutine to determine how closely paging-space utilization
is approaching critical levels. As an example, the following program uses the psdanger() subroutine to
provide a warning message when a threshold is exceeded:
/* psmonitor.c
Monitors system for paging space low conditions. When the condition is
detected, writes a message to stderr.
Usage: psmonitor [Interval [Count]]
Default: psmonitor 1 1000000
*/
#include <stdio.h>
#include <signal.h>
main(int argc,char **argv)
{
int interval = 1; /* seconds */
int count = 1000000; /* intervals */
int current; /* interval */
int last; /* check */
int kill_offset; /* returned by psdanger() */
int danger_offset; /* returned by psdanger() */
Performance management 89
POWER4-based systems
There are several performance issues related to POWER4-based servers.
For related information, see “File system performance” on page 214, “Resource management” on page
35, and IBM Redbooks® publication The POWER4 Processor Introduction and Tuning Guide.
Microprocessor comparison
The following table compares key aspects of different IBM microprocessors.
64-bit kernel
The AIX operating system provides a 64-bit kernel that addresses bottlenecks that could have limited
throughput on 32-way systems.
As of AIX 7.1, the 64-bit kernel is the only kernel available. POWER4 systems are optimized for the 64-bit
kernel, which is intended to increase scalability of RS/6000 System p systems. It is optimized for running
64-bit applications on POWER4 systems. The 64-bit kernel also improves scalability by allowing larger
amounts of physical memory.
Additionally, JFS2 is the default file system for AIX 7.1. You can choose to use either JFS or Enhanced
JFS. For more information on Enhanced JFS, see File system performance.
Performance management 91
Enhanced Journaled File System
Enhanced JFS, or JFS2, is another native AIX journaling file system. This is the default file system for AIX
6.1 and later.
For more information on Enhanced JFS, see “File system performance” on page 214.
Microprocessor performance
This topic includes information on techniques for detecting runaway or processor-intensive programs and
minimizing their adverse affects on system performance.
If you are not familiar with microprocessor scheduling, you may want to refer to the “Processor scheduler
performance” on page 36 topic before continuing.
vmstat command
The first tool to use is the vmstat command, which quickly provides compact information about various
system resources and their related performance problems.
The vmstat command reports statistics about kernel threads in the run and wait queue, memory,
paging, disks, interrupts, system calls, context switches, and CPU activity. The reported CPU activity is
a percentage breakdown of user mode, system mode, idle time, and waits for disk I/O.
Note: If the vmstat command is used without any interval, then it generates a single report. The single
report is an average report from when the system was started. You can specify only the Count parameter
with the Interval parameter. If the Interval parameter is specified without the Count parameter, then the
reports are generated continuously.
As a CPU monitor, the vmstat command is superior to the iostat command in that its one-line-per-
report output is easier to scan as it scrolls and there is less overhead involved if there are many disks
attached to the system. The following example can help you identify situations in which a program has run
away or is too CPU-intensive to run in a multiuser environment.
# vmstat 2
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
1 0 22478 1677 0 0 0 0 0 0 188 1380 157 57 32 0 10
1 0 22506 1609 0 0 0 0 0 0 214 1476 186 48 37 0 16
0 0 22498 1582 0 0 0 0 0 0 248 1470 226 55 36 0 9
This output shows the effect of introducing a program in a tight loop to a busy multiuser system. The
first three reports (the summary has been removed) show the system balanced at 50-55 percent user,
30-35 percent system, and 10-15 percent I/O wait. When the looping program begins, all available CPU
Performance management 93
Average number of kernel threads that are runnable, which includes threads that are running and
threads that are waiting for the CPU. If this number is greater than the number of CPUs, there is at
least one thread waiting for a CPU and the more threads there are waiting for CPUs, the greater the
likelihood of a performance impact.
– b
Average number of kernel threads in the VMM wait queue per second. This includes threads that are
waiting on filesystem I/O or threads that have been suspended due to memory load control.
If processes are suspended due to memory load control, the blocked column (b) in the vmstat
report indicates the increase in the number of threads rather than the run queue.
– p
For vmstat -I The number of threads waiting on I/Os to raw devices per second. Threads waiting on
I/Os to filesystems would not be included here.
• faults
Information about process control, such as trap and interrupt rate. The faults columns are as follows:
– in
Number of device interrupts per second observed in the interval. Additional information can be found
in “Assessing disk performance with the vmstat command ” on page 166.
– sy
The number of system calls per second observed in the interval. Resources are available to user
processes through well-defined system calls. These calls instruct the kernel to perform operations for
the calling process and exchange data between the kernel and the process. Because workloads and
applications vary widely, and different calls perform different functions, it is impossible to define how
many system calls per-second are too many. But typically, when the sy column raises over 10000
calls per second on a uniprocessor, further investigations is called for (on an SMP system the number
is 10000 calls per second per processor). One reason could be "polling" subroutines like the select()
subroutine. For this column, it is advisable to have a baseline measurement that gives a count for a
normal sy value.
– cs
Number of context switches per second observed in the interval. The physical CPU resource is
subdivided into logical time slices of 10 milliseconds each. Assuming a thread is scheduled for
execution, it will run until its time slice expires, until it is preempted, or until it voluntarily gives
up control of the CPU. When another thread is given control of the CPU, the context or working
environment of the previous thread must be saved and the context of the current thread must be
loaded. The operating system has a very efficient context switching procedure, so each switch is
inexpensive in terms of resources. Any significant increase in context switches, such as when cs is a
lot higher than the disk I/O and network packet rate, should be cause for further investigation.
# iostat -t 2 6
tty: tin tout avg-cpu: % user % sys % idle % iowait
0.0 0.8 8.4 2.6 88.5 0.5
0.0 80.2 4.5 3.0 92.1 0.5
0.0 40.5 7.0 4.0 89.0 0.0
0.0 40.5 9.0 2.5 88.5 0.0
The CPU statistics columns (% user, % sys, % idle, and % iowait) provide a breakdown of CPU usage. This
information is also reported in the vmstat command output in the columns labeled us, sy, id, and wa. For
a detailed explanation for the values, see “vmstat command” on page 92. Also note the change made to
%iowait described in “Wait I/O time reporting ” on page 162.
Related tasks
Assessing disk performance with the iostat command
Begin the assessment by running the iostat command with an interval parameter during your system's
peak workload period or while running a critical application for which you need to minimize I/O delays.
Related reference
Continuous performance monitoring with the iostat command
The iostat command is useful for determining disk and CPU usage.
# sar -u 2 5
Average 0 1 0 99 1.00
This example is from a single user workstation and shows the CPU utilization.
Performance management 95
Display previously captured data
The -o and -f options (write and read to/from user given data files) allow you to visualize the behavior of
your machine in two independent steps. This consumes less resources during the problem-reproduction
period.
You can use a separate machine to analyze the data by transferring the file because the collected binary
file keeps all data the sar command needs.
The above command runs the sar command in the background, collects system activity data at 2-second
intervals for 5 intervals, and stores the (unformatted) sar data in the /tmp/sar.out file. The redirection
of standard output is used to avoid a screen output.
The following command extracts CPU information from the file and outputs a formatted report to standard
output:
# sar -f/tmp/sar.out
Average 0 1 0 99 1.00
The captured binary data file keeps all information needed for the reports. Every possible sar report
could therefore be investigated. This also allows to display the processor-specific information of an SMP
system on a single processor system.
#=================================================================
# SYSTEM ACTIVITY REPORTS
# 8am-5pm activity reports every 20 mins during weekdays.
# activity reports every an hour on Saturday and Sunday.
# 6pm-7am activity reports every an hour during weekdays.
# Daily summary prepared at 18:05.
#=================================================================
0 8-17 * * 1-5 /usr/lib/sa/sa1 1200 3 &
0 * * * 0,6 /usr/lib/sa/sa1 &
0 18-7 * * 1-5 /usr/lib/sa/sa1 &
5 18 * * 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 3600 -ubcwyaqvm &
#=================================================================
Collection of data in this manner is useful to characterize system usage over a period of time and to
determine peak usage hours.
# sar -P ALL 2 2
Average 0 9 91 0 0 1.00
1 0 12 0 88 0.00
2 0 2 1 98 0.51
3 0 0 0 100 0.48
- 5 46 0 49 1.99
The last line of every stanza, which starts with a dash (-) in the cpu column, is the average for all
processors. An average (-) line displays only if the -P ALL option is used. It is removed if processors are
specified. The last stanza, labeled with the word Average instead of a time stamp, keeps the averages
for the processor-specific rows over all stanzas.
The following example shows the vmstat output during this time:
The first numbered line is the summary since startup of the system. The second line reflects the start of
the sar command, and with the third row, the reports are comparable. The vmstat command can only
display the average microprocessor utilization over all processors. This is comparable with the dashed
(-) rows from the microprocessor utilization output from the sar command.
When run inside a WPAR environment, the same command produces the following output:
Average *0 9 91 0 0 1.00
Performance management 97
*1 0 12 0 88 0.00
2 0 2 1 98 0.51
3 0 0 0 100 0.48
R 4 51 0 44 1.00
- 5 46 0 49 1.99
The WPAR has an associated RSET registry. Processors 0 and 1 are attached to the RSET. The R line
displays the use by the RSET associated with the WPAR. The processors present in the RSET are
prefixed by the asterisk (*) symbol.
• sar –P RST is used to display the use metrics of the processors present in the RSET. If there is no RSET
associated with the WPAR environment, all of the processor’s metrics are displayed.
The following example shows sar –P RST run inside a WPAR environment:
Average 0 20 80 0 0 1.00
1 9 0 0 91 0.00
R 14 40 0 46 1.00
• sar -u
This displays the microprocessor utilization. It is the default if no other flag is specified. It shows the
same information as the microprocessor statistics of the vmstat or iostat commands.
During the following example, a copy command was started:
# sar -u 3 3
AIX wpar1 1 6 00CBA6FE4C00 04/01/08
wpar1 configuration: lcpu=2 memlim=204MB cpulim=0.06 rset=Regular
05:02:57 cpu %usr %sys %wio %idle physc
05:02:59 0 20 80 0 0 1.00
1 10 0 0 90 0.00
R 15 40 0 45 1.00
05:03:01 0 20 80 0 0 1.00
1 8 0 0 92 0.00
R 14 40 0 46 1.00
Average 0 20 80 0 0 1.00
1 9 0 0 91 0.00
R 14 40 0 46 1.00
When run inside a workload partition, the same command produces the following output:
This displays the %resc information for workload partitions that have processor resource limits
enforced. This metric indicates the percentage of processor resource consumed by the workload
partition.
• sar -c
# sar -c 1 3
19:28:25 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s
19:28:26 134 36 1 0.00 0.00 2691306 1517
19:28:27 46 34 1 0.00 0.00 2716922 1531
19:28:28 46 34 1 0.00 0.00 2716922 1531
While the vmstat command shows system call rates as well, the sar command can also show if
these system calls are read(), write(), fork(), exec(), and others. Pay particular attention to the fork/s
column. If this is high, then further investigation might be needed using the accounting utilities, the
trace command, or the tprof command.
• sar -q
The -q option shows the run-queue size and the swap-queue size.
# sar -q 5 3
runq-sz
The average number of threads that are runnable per second and the percentage of time that the
run queue was occupied (the % field is subject to error).
swpq-sz
The average number of threads in the VMM wait queue and the % of time that the swap queue was
occupied. (The % field is subject to error.)
The -q option can indicate whether you have too many jobs running (runq-sz) or have a potential
paging bottleneck. In a highly transactional system, for example Enterprise Resource Planning (ERP),
the run queue can be in the hundreds, because each transaction uses small amounts of microprocessor
time. If paging is the problem, run the vmstat command. High I/O wait indicates that there is
significant competing disk activity or excessive paging due to insufficient memory.
Performance management 99
Note: If static power-saving mode is enabled, VPM performs energy aware core selection even though
the load balancing for the resource sets is enabled.
is the sum of all of the factors that can delay the program, plus the program's own unattributed costs. On
an SMP, an approximation would be as follows:
# time looper
real 0m3.58s
user 0m3.16s
sys 0m0.04s
In the next example, we run the program at a less favorable priority by adding 10 to its nice value. It takes
almost twice as long to run, but other programs are also getting a chance to do their work:
Note that we placed the nice command within the time command, rather than the reverse. If we had
entered
# time 4threadedprog
real 0m11.70s
user 0m11.09s
sys 0m0.08s
Running it on a 4-way SMP system could show that the real time is only about 1/4 of the user time. The
following output shows that the multithreaded process distributed its workload on several processors
and improved its real execution time. The throughput of the system was therefore increased.
# time 4threadedprog
real 0m3.40s
user 0m9.81s
sys 0m0.09s
CPU intensive
The following shell script:
is a tool for focusing on the highest recently used CPU-intensive user processes in the system (the header
line is reinserted for clarity):
The column (C) indicates the recently used CPU. The process of the looping program leads the list. The
C value can minimize the CPU usage of the looping process because the scheduler stops counting at
120. For a multithreaded process, this field indicates the sum of CP listed for all the threads within that
process.
The following example shows a simple five-thread program with all the threads in an infinite looping
program:
In the CP column, the value 720 indicates the sum of individual threads listed below this value, that is: (5
* 120) + (120).
# ps au
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 19048 24.6 0.0 28 44 pts/1 A 13:53:00 2:16 /tmp/cpubound
root 19388 0.0 0.0 372 460 pts/1 A Feb 20 0:02 -ksh
root 15348 0.0 0.0 372 460 pts/4 A Feb 20 0:01 -ksh
root 20418 0.0 0.0 368 452 pts/3 A Feb 20 0:01 -ksh
root 16178 0.0 0.0 292 364 0 A Feb 19 0:00 /usr/sbin/getty
root 16780 0.0 0.0 364 392 pts/2 A Feb 19 0:00 -ksh
root 18516 0.0 0.0 360 412 pts/0 A Feb 20 0:00 -ksh
root 15746 0.0 0.0 212 268 pts/1 A 13:55:18 0:00 ps au
The %CPU is the percentage of CPU time that has been allocated to that process since the process was
started. It is calculated as follows:
Imagine two processes: The first starts and runs five seconds, but does not finish; then the second
starts and runs five-seconds but does not finish. The ps command would now show 50 percent %CPU
for the first process (five-seconds CPU for 10 seconds of elapsed time) and 100 percent for the second
(five-seconds CPU for five seconds of elapsed time).
On an SMP, this value is divided by the number of available CPUs on the system. Looking back at the
previous example, this is the reason why the %CPU value for the cpubound process never exceeds 25,
because the example is run on a four-way processor system. The cpubound process uses 100 percent of
one processor, but the %CPU value is divided by the number of available CPUs.
# ps -mo THREAD
USER PID PPID TID ST CP PRI SC WCHAN F TT BND COMMAND
root 20918 20660 - A 0 60 1 - 240001 pts/1 - -ksh
- - - 20005 S 0 60 1 - 400 - - -
The TID column shows the thread ID, the BND column shows processes and threads bound to a
processor.
It is normal to see a process named kproc (PID of 516 in operating system version 4) using CPU time.
When there are no threads that can be run during a time slice, the scheduler assigns the CPU time for that
time slice to this kernel process (kproc), which is known as the idle or wait kproc. SMP systems has an idle
kproc for each processor.
# touch acctfile
2. Turn on accounting:
# /usr/sbin/acct/accton acctfile
3. Allow accounting to run for a while and then turn off accounting:
# /usr/sbin/acct/acctcom acctfile
COMMAND START END REAL CPU MEAN
NAME USER TTYNAME TIME TIME (SECS) (SECS) SIZE(K)
#accton root pts/2 19:57:18 19:57:18 0.02 0.02 184.00
#ps root pts/2 19:57:19 19:57:19 0.19 0.17 35.00
#ls root pts/2 19:57:20 19:57:20 0.09 0.03 109.00
#ps root pts/2 19:57:22 19:57:22 0.19 0.17 34.00
#accton root pts/2 20:04:17 20:04:17 0.00 0.00 0.00
#who root pts/2 20:04:19 20:04:19 0.02 0.02 0.00
If you reuse the same file, you can see when the newer processes were started by looking for the accton
process (this was the process used to turn off accounting the first time).
E = Exec'd F = Forked
X = Exited A = Alive (when traced started or stopped)
C = Thread Created
Instructions that are not common on all platforms must be removed from code written in assembler,
because recompilation is only effective for high-level source code. Routines coded in assembler must be
changed so that they do not use missing instructions, because recompilation has no effect in this case.
The first step is to determine if instruction emulation is occurring by using the emstat tool.
To determine whether the emstat program is installed and available, run the following command:
# emstat 1
Emulation Emulation
SinceBoot Delta
0 0
0 0
0 0
Once emulation has been detected, the next step is to determine which application is emulating
instructions. This is much harder to determine. One way is to run only one application at a time
and monitor it with the emstat program. Sometimes certain emulations cause a trace hook to be
encountered. This can be viewed in the ASCII trace report file with the words PROGRAM CHECK. The
process/thread associated with this trace event is emulating instructions either due to its own code
emulating instructions, or it is executing library or code in other modules that are emulating instructions.
# alstat -e 1
Alignment Alignment Emulation Emulation
SinceBoot Delta SinceBoot Delta
0 0 0 0
0 0 0 0
0 0 0 0
The fdpr command is a performance-tuning utility that can improve both performance and real memory
utilization of user-level application programs. The source code is not necessary as input to the fdpr
program. However, stripped executable programs are not supported. If source code is available, programs
built with the -qfdpr compiler flag contain information to assist the fdpr program in producing reordered
Large applications (larger than 5 MB) that are CPU-bound can improve execution time up to 23 percent,
but typically the performance is improved between 5 and 20 percent. The reduction of real memory
requirements for text pages for this type of program can reach 70 percent. The average is between 20 and
50 percent. The numbers depend on the application's behavior and the optimization options issued when
using the fdpr program.
The fdpr processing takes place in three stages:
1. The executable module to be optimized is instrumented to allow detailed performance-data collection.
2. The instrumented executable module is run in a workload provided by the user, and performance data
from that run is recorded.
3. The performance data is used to drive a performance-optimization process that results in a
restructured executable module that should perform the workload that exercised the instrumented
executable program more efficiently. It is critically important that the workload used to drive the
fdpr program closely match the actual use of the program. The performance of the restructured
executable program with workloads that differ substantially from that used to drive the fdpr program
is unpredictable, but can be worse than that of the original executable program.
As an example, the # fdpr -p ProgramName -R3 -x test.sh command would use the test case
test.sh to run an instrumented form of program ProgramName. The output of that run would be used
to perform the most aggressive optimization (-R3) of the program to form a new module called, by
default, ProgramName.fdpr. The degree to which the optimized executable program performed better
in production than its predecessor would depend largely on the degree to which the test case test.sh
successfully imitated the production workload.
Note: The fdpr program incorporates advanced optimization algorithms that sometimes result in
optimized executable programs that do not function in the same way as the original executable module. It
is absolutely essential that any optimized executable program be thoroughly tested before being used in
any production situation; that is, before its output is trusted.
In summary, users of the fdpr program should adhere to the following:
• Take pains to use a workload to drive the fdpr program that is representative of the intended use.
• Thoroughly test the functioning of the resulting restructured executable program.
If you use the root login, the vmstat command can be run at a more favorable priority with the following:
If you were not using root login and issued the preceding example nice command, the vmstat command
would still be run but at the standard nice value of 20, and the nice command would not issue any error
message.
retcode = setpri(0,59);
would give the current process a fixed priority of 59. If the setpri() subroutine fails, it returns -1.
The following program accepts a priority value and a list of process IDs and sets the priority of all of the
processes to the specified value.
/*
fixprocpri.c
Usage: fixprocpri priority PID . . .
*/
argv++;
Priority=atoi(*argv++);
if ( Priority < 50 ) {
printf(" Priority must be >= 50 \n");
exit(1);
}
while (*argv) {
ProcessID=atoi(*argv++);
ReturnP = setpri(ProcessID, Priority);
if ( ReturnP > 0 )
printf("pid=%d new pri=%d old pri=%d\n",
(int)ProcessID,Priority,ReturnP);
else {
perror(" setpri failed ");
exit(1);
}
}
}
# ps -lu user1
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
241801 S 200 7032 7286 0 60 20 1b4c 108 pts/2 0:00 ksh
200801 S 200 7568 7032 0 70 25 2310 88 5910a58 pts/2 0:00 vmstat
241801 S 200 8544 6494 0 60 20 154b 108 pts/0 0:00 ksh
The output shows the result of the nice -n 5 command described previously. Process 7568 has an
inferior priority of 70. (The ps command was run by a separate session in superuser mode, hence the
presence of two TTYs.)
If one of the processes had used the setpri(10758, 59) subroutine to give itself a fixed priority, a sample
ps -l output would be as follows:
# renice -n -5 7568
# ps -lu user1
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
241801 S 200 7032 7286 0 60 20 1b4c 108 pts/2 0:00 ksh
200801 S 200 7568 7032 0 60 20 2310 92 5910a58 pts/2 0:00 vmstat
241801 S 200 8544 6494 0 60 20 154b 108 pts/0 0:00 ksh
Now the process is running at a more favorable priority that is equal to the other foreground processes. To
undo the effects of this, you could issue the following:
# renice -n 5 7568
# ps -lu user1
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
241801 S 200 7032 7286 0 60 20 1b4c 108 pts/2 0:00 ksh
200801 S 200 7568 7032 1 70 25 2310 92 5910a58 pts/2 0:00 vmstat
241801 S 200 8544 6494 0 60 20 154b 108 pts/0 0:00 ksh
In these examples, the renice command was run by the root user. When run by an ordinary user ID,
there are two major limitations to the use of the renice command:
• Only processes owned by that user ID can be specified.
• The nice value of the process cannot be decreased, not even to return the process to the default
priority after making its priority less favorable with the renice command.
Nice Renice
Command Command Resulting nice Value Best Priority Value
nice -n 5 renice -n 5 25 70
nice -n +5 renice -n +5 25 70
nice -n -5 renice -n -5 15 55
Thread-Priority-Value calculation
This section discusses tuning using priority calculation and the schedo command.
The “schedo command ” on page 112 command allows you to change some of the CPU scheduler
parameters used to calculate the priority value for each thread. See “Process and thread priority” on page
37 for background information on priority.
To determine whether the schedo program is installed and available, run the following command:
Priority calculation
The kernel maintains a priority value (sometimes termed the scheduling priority) for each thread. The
priority value is a positive integer and varies inversely with the importance of the associated thread. That
is, a smaller priority value indicates a more important thread. When the scheduler is looking for a thread
to dispatch, it chooses the dispatchable thread with the smallest priority value.
The formula for calculating the priority value is:
priority value = base priority + nice penalty + (CPU penalty based on recent CPU usage)
CPU_penalty = C * R/32
Once a second, the default algorithm divides the recent CPU usage value of every thread by 2. The
recent-CPU-usage-decay factor is therefore 0.5. This factor is controlled by a value called D (the default is
16). The formula is as follows:
C = C * D/32
The algorithm for calculating priority value uses the nice value of the process to determine the priority of
the threads in the process. As the units of CPU time increase, the priority decreases with the nice effect.
Using schedo -r -d can give additional control over the priority calculation by setting new values for R
and D. See “schedo command ” on page 112 for further information.
Begin with the following equation:
If the nice value is greater than 20, then it has double the impact on the priority value than if it was less
than or equal to 20. The new priority calculation (ignoring integer truncation) is as follows:
schedo command
Tuning is accomplished through two options of the schedo command: sched_R and sched_D.
Each option specifies a parameter that is an integer from 0 through 32. The parameters are applied by
multiplying by the parameter's value and then dividing by 32. The default R and D values are 16, which
yields the same behavior as the original algorithm [(D=R=16)/32=0.5]. The new range of values permits a
far wider spectrum of behaviors. For example:
# schedo -o sched_R=0
[(R=0)/32=0, (D=16)/32=0.5] would mean that the CPU penalty was always 0, making priority a function
of the nice value only. No background process would get any CPU time unless there were no dispatchable
foreground processes at all. The priority values of the threads would effectively be constant, although
they would not technically be fixed-priority threads.
# schedo -o sched_R=5
[(R=5)/32=0.15625, (D=16)/32=0.5] would mean that a foreground process would never have to
compete with a background process started with the command nice -n 10. The limit of 120 CPU time
slices accumulated would mean that the maximum CPU penalty for the foreground process would be 18.
[(R=6)/32=0.1875, (D=16)/32=0.5] would mean that, if the background process were started with the
command nice -n 10, it would be at least one second before the background process began to
receive any CPU time. Foreground processes, however, would still be distinguishable on the basis of CPU
[(R=32)/32=1, (D=32)/32=1] would mean that long-running threads would reach a C value of 120 and
remain there, contending on the basis of their nice values. New threads would have priority, regardless
of their nice value, until they had accumulated enough time slices to bring them within the priority value
range of the existing threads.
Here are some guidelines for R and D:
• Smaller values of R narrow the priority range and make the nice value have more of an impact on the
priority.
• Larger values of R widen the priority range and make the nice value have less of an impact on the
priority.
• Smaller values of D decay CPU usage at a faster rate and can cause CPU-intensive threads to be
scheduled sooner.
• Larger values of D decay CPU usage at a slower rate and penalize CPU-intensive threads more (thus
favoring interactive-type threads).
current_effective_priority
| base process priority
| | nice value
| | | count (time slices consumed)
| | | | (schedo -o sched_R)
| | | | |
time 0 p = 40 + 20 + (0 * 4/32) = 60
time 10 ms p = 40 + 20 + (1 * 4/32) = 60
time 20 ms p = 40 + 20 + (2 * 4/32) = 60
time 30 ms p = 40 + 20 + (3 * 4/32) = 60
time 40 ms p = 40 + 20 + (4 * 4/32) = 60
time 50 ms p = 40 + 20 + (5 * 4/32) = 60
time 60 ms p = 40 + 20 + (6 * 4/32) = 60
time 70 ms p = 40 + 20 + (7 * 4/32) = 60
time 80 ms p = 40 + 20 + (8 * 4/32) = 61
time 90 ms p = 40 + 20 + (9 * 4/32) = 61
time 100ms p = 40 + 20 + (10 * 4/32) = 61
.
(skipping forward to 1000msec or 1 second)
.
time 1000ms p = 40 + 20 + (100 * 4/32) = 72
time 1000ms swapper recalculates the accumulated CPU usage counts of
all processes. For the above process:
new_CPU_usage = 100 * 31/32 = 96 (if d=31)
after decaying by the swapper: p = 40 + 20 + ( 96 * 4/32) = 72
(if d=16, then p = 40 + 20 + (100/2 * 4/32) = 66)
time 1010ms p = 40 + 20 + ( 97 * 4/32) = 72
time 1020ms p = 40 + 20 + ( 98 * 4/32) = 72
time 1030ms p = 40 + 20 + ( 99 * 4/32) = 72
..
time 1230ms p = 40 + 20 + (119 * 4/32) = 74
time 1240ms p = 40 + 20 + (120 * 4/32) = 75 count <= 120
time 1250ms p = 40 + 20 + (120 * 4/32) = 75
time 1260ms p = 40 + 20 + (120 * 4/32) = 75
..
time 2000ms p = 40 + 20 + (120 * 4/32) = 75
time 2000ms swapper recalculates the counts of all processes.
For above process 120 * 31/32 = 116
time 2010ms p = 40 + 20 + (117 * 4/32) = 74
Memory performance
This section describes how memory use can be measured and modified.
The memory of a system is almost constantly filled to capacity. Even if currently running programs do
not consume all available memory, the operating system retains in memory the text pages of programs
that ran earlier and the files that they used. There is no cost associated with this retention, because the
memory would have been unused anyway. In many cases, the programs or files will be used again, which
reduces disk I/O.
Memory usage
Several performance tools provide memory usage reports.
The reports of most interest are from the vmstat, ps, and svmon commands.
# vmstat 2 10
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
1 3 113726 124 0 14 6 151 600 0 521 5533 816 23 13 7 57
0 3 113643 346 0 2 14 208 690 0 585 2201 866 16 9 2 73
0 3 113659 135 0 2 2 108 323 0 516 1563 797 25 7 2 66
0 2 113661 122 0 3 2 120 375 0 527 1622 871 13 7 2 79
0 3 113662 128 0 10 3 134 432 0 644 1434 948 22 7 4 67
1 5 113858 238 0 35 1 146 422 0 599 5103 903 40 16 0 44
0 3 113969 127 0 5 10 153 529 0 565 2006 823 19 8 3 70
0 3 113983 125 0 33 5 153 424 0 559 2165 921 25 8 4 63
0 3 113682 121 0 20 9 154 470 0 608 1569 1007 15 8 0 77
0 4 113701 124 0 3 29 228 635 0 674 1730 1086 18 9 0 73
In the example output above, notice the high I/O wait in the output and also the number of threads on the
blocked queue. Other I/O activities might cause I/O wait, but in this particular case, the I/O wait is most
likely due to the paging in and out from paging space.
To see if the system has performance problems with its VMM, examine the columns under memory and
page:
• memory
Provides information about the real and virtual memory.
– avm
The Active Virtual Memory, avm, column represents the number of active virtual memory pages
present at the time the vmstat sample was collected. The deferred page space policy is the default
policy. Under this policy, the value for avm might be higher than the number of paging space pages
used. The avm statistics do not include file pages.
– fre
The fre column shows the average number of free memory pages. A page is a 4 KB area of real
memory. The system maintains a buffer of memory pages, called the free list, that will be readily
accessible when the VMM needs space. The minimum number of pages that the VMM keeps on the
free list is determined by the minfree parameter of the vmo command. For more details, see “VMM
page replacement tuning” on page 137.
When an application terminates, all of its working pages are immediately returned to the free list. Its
persistent pages, or files, however, remain in RAM and are not added back to the free list until they
# vmstat -s
3231543 total address trans. faults
63623 page ins
383540 page outs
149 paging space page ins
832 paging space page outs
0 total reclaims
807729 zero filled pages faults
4450 executable filled pages faults
429258 pages examined by clock
8 revolutions of the clock hand
175846 pages freed by the clock
18975 backtracks
0 lock misses
40 free frame waits
0 extend XPT waits
16984 pending I/O waits
186443 start I/Os
186443 iodones
141695229 cpu context switches
317690215 device interrupts
0 software interrupts
0 traps
55102397 syscalls
The page-in and page-out numbers in the summary represent virtual memory activity to page in or out
pages from page space and file space. The paging space ins and outs are representative of only page
space.
# ps v
PID TTY STAT TIME PGIN SIZE RSS LIM TSIZ TRS %CPU %MEM COMMAND
36626 pts/3 A 0:00 0 316 408 32768 51 60 0.0 0.0 ps v
The most important columns on the resulting ps report are described as follows:
PGIN
Number of page-ins caused by page faults. Since all I/O is classified as page faults, this is basically a
measure of I/O volume.
SIZE
Virtual size (in paging space) in kilobytes of the data section of the process (displayed as SZ by
other flags). This number is equal to the number of working segment pages of the process that have
been touched times 4. If some working segment pages are currently paged out, this number is larger
than the amount of real memory being used. SIZE includes pages in the private segment and the
shared-library data segment of the process.
RSS
Real-memory (resident set) size in kilobytes of the process. This number is equal to the sum of
the number of working segment and code segment pages in memory times 4. Remember that code
segment pages are shared among all of the currently running instances of the program. If 26 ksh
processes are running, only one copy of any given page of the ksh executable program would be
in memory, but the ps command would report that code segment size as part of the RSS of each
instance of the ksh program.
TSIZ
Size of text (shared-program) image. This is the size of the text section of the executable file. Pages
of the text section of the executable program are only brought into memory when they are touched,
that is, branched to or loaded from. This number represents only an upper bound on the amount of
text that could be loaded. The TSIZ value does not reflect actual memory usage. This TSIZ value can
also be seen by executing the dump -ov command against an executable program (for example, dump
-ov /usr/bin/ls).
TRS
Size of the resident set (real memory) of text. This is the number of code segment pages times 4. This
number exaggerates memory use for programs of which multiple instances are running. The TRS value
can be higher than the TSIZ value because other pages may be included in the code segment such as
the XCOFF header and the loader section.
%MEM
Calculated as the sum of the number of working segment and code segment pages in memory times
4 (that is, the RSS value), divided by the size of the real memory in use, in the machine in KB, times
100, rounded to the nearest full percentage point. This value attempts to convey the percentage of
real memory being used by the process. Unfortunately, like RSS, it tends the exaggerate the cost
of a process that is sharing program text with other processes. Further, the rounding to the nearest
To determine whether svmon is installed and available, run the following command:
# svmon -G -i 1 2
Notice that if only 4 KB pages are available on the system, the section that breaks down the information
per page size is not displayed.
The columns on the resulting svmon report are described as follows:
memory
Statistics describing the use of real memory, shown in 4 KB pages.
size
Total size of memory in 4 KB pages.
inuse
Number of pages in RAM that are in use by a process plus the number of persistent pages that
belonged to a terminated process and are still resident in RAM. This value is the total size of
memory minus the number of pages on the free list.
free
Number of pages on the free list.
pin
Number of pages pinned in RAM (a pinned page is a page that is always resident in RAM and
cannot be paged out).
virtual
Number of pages allocated in the process virtual space.
pg space
Statistics describing the use of paging space, shown in 4 KB pages. The value reported is the actual
number of paging-space pages used, which indicates that these pages were paged out to the paging
space. This differs from the vmstat command in that the vmstat command's avm column which
shows the virtual memory accessed but not necessarily paged out.
size
Total size of paging space in 4 KB pages.
inuse
Total number of allocated pages.
pin
Detailed statistics on the subset of real memory containing pinned pages, shown in 4 KB frames.
# svmon -P
--------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd 16MB
16264 IBM.ServiceRM 10075 3345 3064 13310 N Y N
--------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd 16MB
17032 IBM.CSMAgentR 9791 3347 3167 12944 N Y N
The command output details both the global memory use per process and also detailed memory use per
segment used by each reported process. The default sorting rule is a decreasing sort based on the Inuse
page count. You can change the sorting rule using the svmon command with either the -u, -p, -g, or -v
flags.
For a summary of the top 15 processes using memory on the system, use the following command:
The Pid 16 264 is the process ID that has the highest memory consumption. The Command indicates the
command name, in this case IBM.ServiceRM. The Inuse column, which is the total number of pages
in real memory from segments that are used by the process, shows 10 075 pages. Each page is 4 KB.
The Pin column, which is the total number of pages pinned from segments that are used by the process,
shows 3 345 pages. The Pgsp column, which is the total number of paging-space pages that are used by
the process, shows 3 064 pages. The Virtual column (total number of pages in the process virtual space)
shows 13 310.
The detailed section displays information about each segment for each process that is shown in the
summary section. This includes the virtual, Vsid, and effective, Esid, segment identifiers. The Esid
reflects the segment register that is used to access the corresponding pages. The type of the segment is
also displayed along with its description that consists in a textual description of the segment, including
the volume name and i-node of the file for persistent segments. The report also details the size of the
The Address Range specifies one range for a persistent or client segment and two ranges for a working
segment. The range for a persistent or a client segment takes the form '0..x,' where x is the maximum
number of virtual pages that have been used. The range field for a working segment can be '0..x :
y..65535', where 0..x contains global data and grows upward, and y..65535 contains stack area and grows
downward. For the address range, in a working segment, space is allocated starting from both ends and
working towards the middle. If the working segment is non-private (kernel or shared library), space is
allocated differently.
In the above example, the segment ID 400 is a private working segment; its address range is 0..969 :
65305..65535. The segment ID f001e is a shared library text working segment; its address range is
0..60123.
A segment can be used by multiple processes. Each page in real memory from such a segment is
accounted for in the Inuse field for each process using that segment. Thus, the total for Inuse may
exceed the total number of pages in real memory. The same is true for the Pgsp and Pin fields. The
values displayed in the summary section consist of the sum of Inuse, Pin, and Pgsp, and Virtual
counters of all segments used by the process.
In the above example, the e83dd segment is used by several processes whose PIDs are 17552, 17290,
17032, 16264, 14968 and 9620.
# svmon -D 38287 -b
Segid: 38287
Type: working
PSize: s (4 KB)
Address Range: 0..484
Size of page space allocation: 2 pages ( 0,0 MB)
Virtual: 18 frames ( 0,1 MB)
Inuse: 16 frames ( 0,1 MB)
The -b flag shows the status of the reference and modified bits of all the displayed frames. After it is
shown, the reference bit of the frame is reset. When used with the -i flag, it detects which frames are
accessed between each interval.
Note: Due to the performance impacts, use the -b flag with caution.
# svmon -Sut 10
# svmon -G
size inuse free pin virtual
memory 1048576 417374 631202 66533 151468
pg space 262144 31993
The vmstat command was run in a separate window while the svmon command was running. The
vmstat report follows:
# vmstat 3
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
1 5 205031 749504 0 0 0 0 0 0 1240 248 318 0 0 99 0
2 2 151360 631310 0 0 3 3 32 0 1187 1718 641 1 1 98 0
1 0 151366 631304 0 0 0 0 0 0 1335 2240 535 0 1 99 0
1 0 151366 631304 0 0 0 0 0 0 1303 2434 528 1 4 95 0
1 0 151367 631303 0 0 0 0 0 0 1331 2202 528 0 0 99 0
The global svmon report shows related numbers. The fre column of the vmstat command relates to
the memory free column of the svmon command. The Active Virtual Memory, avm, value of the vmstat
command reports is similar to the virtual memory value that the svmon command reports.
Example 1
The following is an example for the svmon command and ps command output:
# # ps v 405528
PID TTY STAT TIME PGIN SIZE RSS LIM TSIZ TRS %CPU %MEM COMMAND
405528 pts/0 A 43:11 1 168 172 32768 1 4 99.5 0.0 yes
...............................................................................
EXCLUSIVE segments Inuse Pin Pgsp Virtual
172 16 0 168
The ps command output above displays the SIZE as 168 and RSS as 172. The use of the svmon
command above provides both the values.
SIZE = Work Process Private Memory Usage in KB + Work Shared Library Data Memory Usage in KB
RSS = SIZE + Text Code Size (Type=clnt, Description=code,)
Using the values in the example above you get the following:
SIZE = 92 + 76 = 168
RSS = 168 + 4 = 172
Example 2
The following is an example for the svmon command and ps command output:
# ps v 282844
PID TTY STAT TIME PGIN SIZE RSS LIM TSIZ TRS %CPU %MEM COMMAND
282844 - A 15:49 322 24604 25280 xx 787 676 0.0 3.0 /opt/rsct/b
...............................................................................
EXCLUSIVE segments Inuse Pin Pgsp Virtual
25308 16 0 24604
The ps command output above displays the SIZE as 24604 and RSS as 25280.
You can use the output values from the svmon command displayed above with the following equations to
calculate the SIZE and RSS:
SIZE = Work Process Private Memory Usage in KB + Work Shared Library Data Memory Usage in KB
RSS = SIZE + Text Code Size (Type=clnt, Description=code,)
Using the values in the example above you get the following:
where:
T
= Number of pages for text (shared by all users)
N
= Number of copies of this program running simultaneously
PD
= Number of working segment pages in process private segment
Memory-leaking programs
A memory leak is a program error that consists of repeatedly allocating memory, using it, and then
neglecting to free it.
A memory leak in a long-running program, such as an interactive application, is a serious problem,
because it can result in memory fragmentation and the accumulation of large numbers of mostly garbage-
filled pages in real memory and page space. Systems have been known to run out of page space because
of a memory leak in a single program.
A memory leak can be detected with the svmon command, by looking for processes whose working
segment continually grows. A leak in a kernel segment can be caused by an mbuf leak or by a device
driver, kernel extension, or even the kernel. To determine if a segment is growing, use the svmon
command with the -i option to look at a process or a group of processes and see if any segment continues
to grow.
Identifying the offending subroutine or line of code is more difficult, especially in AIXwindows
applications, which generate large numbers of malloc() and free() calls. C++ provides a HeapView
Debugger for analyzing/tuning memory usage and leaks. Some third-party programs exist for analyzing
memory leaks, but they require access to the program source code.
Some uses of the realloc() subroutine, while not actually programming errors, can have the same effect
as a memory leak. If a program frequently uses the realloc() subroutine to increase the size of a data
area, the working segment of the process can become increasingly fragmented if the storage released by
the realloc() subroutine cannot be reused for anything else.
Use the disclaim() system call and free() call to release memory that is no longer required. The disclaim()
system call must be called before the free() call. It wastes CPU time to free memory after the last
malloc() call, if the program will finish soon. When the program terminates, its working segment is
destroyed and the real memory page frames that contained working segment data are added to the
free list. The following example is a memory-leaking program where the Inuse, Pgspace, and Address
Range values of the private working segment are continually growing:
# svmon -P 13548 -i 1 3
Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd
LPage
13548 pacman 8535 2178 847 8533 N
N N
Whenever the rmss command changes memory size, the minperm and maxperm are not adjusted to the
new parameters and the number of lruable pages is not changed to fit the simulated memory size. This
can lead to an unexpected behavior where the buffer cache will grow out of proportion. As a consequence,
the system can run out of memory.
It is important to keep in mind that the memory size simulated by the rmss command is the total size of
the machine's real memory, including the memory used by the operating system and any other programs
that may be running. It is not the amount of memory used specifically by the application itself. Because
of the performance degradation it can cause, the rmss command can be used only by a root user or a
member of the system group.
rmss command
You can use the rmss command to change the memory size and exit or as a driver program that executes
a specified application multiple times over a range of memory sizes and displays important statistics that
describe the application's performance at each memory size.
The first method is useful when you want to get the look and feel of how your application performs at a
given system memory size, when your application is too complex to be expressed as a single command, or
when you want to run multiple instances of the application. The second method is appropriate when you
have an application that can be invoked as an executable program or shell script file.
The calculation of the (after - before) paging I/O numbers can be automated by using the vmstatit
script described in “Disk or memory-related problem” on page 32.
# rmss -c 128
The memory size is an integer or decimal fraction number of megabytes (for example, 128.25).
Additionally, the size must be between 8 MB and the amount of physical real memory in your machine.
Depending on the hardware and software configuration, the rmss command may not be able to change
the memory size to small sizes, because of the size of inherent system structures such as the kernel.
When the rmss command is unable to change to a given memory size, it displays an error message.
The rmss command reduces the effective memory size of a system by stealing free page frames from
the list of free frames that is maintained by the VMM. The stolen frames are kept in a pool of unusable
frames and are returned to the free frame list when the effective memory size is to be increased. Also,
the rmss command dynamically adjusts certain system variables and data structures that must be kept
proportional to the effective size of memory.
# rmss -p
Finally, if you want to reset the memory size to the actual memory size of the machine, use the -r flag, as
follows:
# rmss -r
No matter what the current simulated memory size, using the -r flag sets the memory size to be the
physical real memory size of the machine.
Because this example was run on a 256 MB machine, the rmss command responded as follows:
Note: The rmss command reports usable real memory. On machines that contain bad memory or memory
that is in use, the rmss command reports the amount of real memory as the amount of physical real
memory minus the memory that is bad or in use by the system. For example, the rmss -r command might
report:
This could be a result of some pages being marked bad or a result of a device that is reserving some pages
for its own use and thus not available to the user.
Application execution over a range of memory sizes with the rmss command
As a driver program, the rmss command executes a specified application over a range of memory sizes
and displays statistics describing the application's performance at each memory size.
The -s, -f, -d, -n, and -o flags of the rmss command are used in combination to invoke the rmss
command as a driver program. The syntax for this invocation style of the rmss command is as follows:
Each of the following flags is discussed in detail below. The -s, -f, and -d flags are used to specify the
range of memory sizes.
-n
This flag is used to specify the number of times to run and measure the command at each memory
size.
-o
This flag is used to specify the file into which to write the rmss report, while command is the
application that you wish to run and measure at each memory size.
-s
This flag specifies the starting size.
-f
This flag specifies the final size.
-s 256 -f 128 -d 32
Likewise, if you wanted to run and measure a command at 128, 160, 192, 224, and 256 MB, you would
use the following combination:
-s 128 -f 256 -d 32
If the -s flag is omitted, the rmss command starts at the actual memory size of the machine. If the -f
flag is omitted, the rmss command finishes at 8 MB. If the -d flag is omitted, there is a default of 8 MB
between memory sizes.
What values should you choose for the -s, -f, and -d flags? A simple choice would be to cover the memory
sizes of systems that are being considered to run the application you are measuring. However, increments
of less than 8 MB can be useful, because you can get an estimate of how much space you will have
when you settle on a given size. For instance, if a given application thrashes at 120 MB but runs without
page-ins at 128 MB, it would be useful to know where within the 120 to 128 MB range the application
starts thrashing. If it starts at 127 MB, you may want to consider configuring the system with more than
128 MB of memory, or you may want to try to modify the application so that there is more space. On
the other hand, if the thrashing starts at 121 MB, you know that you have enough space with a 128 MB
machine.
The -n flag is used to specify how many times to run and measure the command at each memory size.
After running and measuring the command the specified number of times, the rmss command displays
statistics describing the average performance of the application at that memory size. To run the command
3 times at each memory size, you would use the following:
-n 3
If the -n flag is omitted, the rmss command determines during initialization how many times your
application must be run to accumulate a total run time of 10 seconds. The rmss command does this
to ensure that the performance statistics for short-running programs will not be significantly skewed by
outside influences, such as daemons.
Note: If you are measuring a very brief program, the number of iterations required to accumulate 10
seconds of CPU time can be very large. Because each execution of the program takes a minimum of about
2 elapsed seconds of rmss overhead, specify the -n parameter explicitly for short programs.
What are good values to use for the -n flag? If you know that your application takes much more than 10
seconds to run, you can specify -n 1 so that the command is run twice, but measured only once at each
memory size. The advantage of using the -n flag is that the rmss command will finish sooner because it
will not have to spend time during initialization to determine how many times to run your program. This
can be particularly valuable when the command being measured is long-running and interactive.
It is important to note that the rmss command always runs the command once at each memory size as
a warm-up before running and measuring the command. The warm-up is needed to avoid the I/O that
occurs when the application is not already in memory. Although such I/O does affect performance, it is
not necessarily due to a lack of real memory. The warm-up run is not included in the number of iterations
specified by the -n flag.
The -o flag is used to specify a file into which to write the rmss report. If the -o flag is omitted, the report
is written into the file rmss.out.
Finally, command is used to specify the application to be measured. It can be an executable program
or shell script, with or without command-line arguments. There are some limitations on the form of the
command however. First, it cannot contain the redirection of input or output (for example, foo > output or
foo < input). This is because the rmss command treats everything to the right of the command name as
an argument to the command. To redirect, place the command in a shell script file.
Hostname: aixhost1.austin.ibm.com
Real memory size: 16.00 Mb
Time of day: Thu Mar 18 19:04:04 2004
Command: foo
Memory size Avg. Pageins Avg. Response Time Avg. Pagein Rate
(megabytes) (sec.) (pageins / sec.)
-----------------------------------------------------------------
16.00 115.0 123.9 0.9
15.00 112.0 125.1 0.9
14.00 179.0 126.2 1.4
13.00 81.0 125.7 0.6
12.00 403.0 132.0 3.1
11.00 855.0 141.5 6.0
10.00 1161.0 146.8 7.9
9.00 1529.0 161.3 9.5
8.00 2931.0 202.5 14.5
The report consists of four columns. The leftmost column gives the memory size, while the Avg. Pageins
column gives the average number of page-ins that occurred when the application was run at that memory
size. It is important to note that the Avg. Pageins column refers to all page-in operations, including code,
data, and file reads, from all programs, that completed while the application ran. The Avg. Response Time
column gives the average amount of time it took the application to complete, while the Avg. Pagein Rate
column gives the average rate of page-ins.
Concentrate on the Avg. Pagein Rate column. From 16 MB to 13 MB, the page-in rate is relatively small
(< 1.5 page-ins per second). However, from 13 MB to 8 MB, the page-in rate grows gradually at first, and
then rapidly as 8 MB is reached. The Avg. Response Time column has a similar shape: relatively flat at first,
then increasing gradually, and finally increasing rapidly as the memory size is decreased to 8 MB.
Here, the page-in rate actually decreases when the memory size changes from 14 MB (1.4 page-ins
per second) to 13 MB (0.6 page-ins per second). This is not cause for alarm. In an actual system, it is
impossible to expect the results to be perfectly smooth. The important point is that the page-in rate is
relatively low at both 14 MB and 13 MB.
Finally, you can make a couple of deductions from the report. First, if the performance of the application
is deemed unacceptable at 8 MB (as it probably would be), then adding memory would enhance
performance significantly. Note that the response time rises from approximately 124 seconds at 16 MB
to 202 seconds at 8 MB, an increase of 63 percent. On the other hand, if the performance is deemed
unacceptable at 16 MB, adding memory will not enhance performance much, because page-ins do not
slow the program appreciably at 16 MB.
Hostname: aixhost2.austin.ibm.com
Real memory size: 48.00 Mb
Time of day: Mon Mar 22 18:16:42 2004
Command: cp /mnt/a16Mfile /dev/null
Memory size Avg. Pageins Avg. Response Time Avg. Pagein Rate
(megabytes) (sec.) (pageins / sec.)
-----------------------------------------------------------------
48.00 0.0 2.7 0.0
40.00 0.0 2.7 0.0
32.00 0.0 2.7 0.0
24.00 1520.8 26.9 56.6
16.00 4104.2 67.5 60.8
8.00 4106.8 66.9 61.4
The response time and page-in rate in this report start relatively low, rapidly increase at a memory size of
24 MB, and then reach a plateau at 16 and 8 MB. This report shows the importance of choosing a wide
range of memory sizes when you use the rmss command. If this user had only looked at memory sizes
from 24 MB to 8 MB, he or she might have missed an opportunity to configure the system with enough
memory to accommodate the application without page-ins.
Hints for usage of the -s, -f, -d, -n, and -o flags
One helpful feature of the rmss command, when used in this way, is that it can be terminated with the
interrupt key (Ctrl + C by default) without destroying the report that has been written to the output file. In
addition to writing the report to the output file, this causes the rmss command to reset the memory size
to the physical memory size of the machine.
You can run the rmss command in the background, even after you have logged out, by using the nohup
command. To do this, precede the rmss command by the nohup command, and follow the entire
command with an & (ampersand), as follows:
# schedo -a
v_repage_hi = 0
v_repage_proc = 4
v_sec_wait = 1
v_min_process = 2
v_exempt_secs = 2
pacefork = 10
sched_D = 16
sched_R = 16
timeslice = 1
maxspin = 1
%usDelta = 100
affinity_lim = n/a
idle_migration_barrier = n/a
fixed_pri_global = n/a
big_tick_size = 1
force_grq = n/a
The first five parameters specify the thresholds for the memory load control algorithm. These parameters
set rates and thresholds for the algorithm. If the algorithm shows that RAM is overcommitted, the
v_repage_proc, v_min_process, v_sec_wait, and v_exempt_secs values are used. Otherwise, these values
are ignored. If memory load control is disabled, these latter values are not used.
After a tuning experiment, memory load control can be reset to its default characteristics by executing the
command schedo -D.
The schedo -o v_repage_hi=0 command effectively disables memory load control. If a system has at
least 128 MB of memory, the default value is 0, otherwise the default value is 6. With at least 128 MB
of RAM, the normal VMM algorithms usually correct thrashing conditions on the average more efficiently
than by using memory load control.
In some specialized situations, it might be appropriate to disable memory load control from the outset.
For example, if you are using a terminal emulator with a time-out feature to simulate a multiuser
workload, memory load control intervention may result in some responses being delayed long enough for
the process to be killed by the time-out feature. Another example is, if you are using the rmss command
to investigate the effects of reduced memory sizes, disable memory load control to avoid interference
with your measurement.
If disabling memory load control results in more, rather than fewer, thrashing situations (with
correspondingly poorer responsiveness), then memory load control is playing an active and supportive
role in your system. Tuning the memory load control parameters then may result in improved
performance or you may need to add RAM.
A lower value of v_repage_hi raises the thrashing detection threshold; that is, the system is allowed to
come closer to thrashing before processes are suspended. Regardless of the system configuration, when
the above po/fr fraction is low, thrashing is unlikely.
To alter the threshold to 4, enter the following:
# schedo -o v_repage_hi=4
In this way, you permit the system to come closer to thrashing before the algorithm starts suspending
processes.
The default value of v_repage_proc is 4, meaning that a process is considered to be thrashing (and a
candidate for suspension) when the fraction of repages to page faults over the last second is greater than
25 percent. A low value of v_repage_proc results in a higher degree of individual process thrashing being
allowed before a process is eligible for suspension.
To disable processes from being suspended by the memory load control, do the following:
# schedo -o v_repage_proc=0
Note that fixed-priority processes and kernel processes are exempt from being suspended.
# schedo -o v_min_process=4
On these systems, setting the v_min_process parameter to 4 or 6 may result in the best performance.
Lower values of v_min_process , while allowed, mean that at times as few as one user process may be
active.
When the memory requirements of the thrashing application are known, thev_min_process value can be
suitably chosen. Suppose thrashing is caused by numerous instances of one application of size M. Given
the system memory size N, thev_min_process parameter should be set to a value close to N/M. Setting the
v_min_process value too low would unnecessarily limit the number of processes that could be active at
the same time.
# schedo -o v_sec_wait=2
# schedo -o v_exempt_secs=1
Suppose thrashing is caused occasionally by an application that uses lots of memory but runs for about T
seconds. The default system setting of 2 seconds for the v_exempt_secs parameter probably causes this
application swapping in and out T/2 times on a busy system. In this case, resetting the v_exempt_secs
parameter to a longer time helps this application progress. System performance improves when this
offending application is pushed through quickly.
Executing the vmo command with the -a option displays the current parameter settings.
Note: The vmo command is a self documenting command. You might get different output than the sample
output provided here.
# vmo -a
ame_cpus_per_pool = n/a
ame_maxfree_mem = n/a
ame_min_ucpool_size = n/a
ame_minfree_mem = n/a
ams_loan_policy = n/a
enhanced_affinity_affin_time = 1
enhanced_affinity_vmpool_limit = 10
esid_allocator = 1
force_relalias_lite = 0
kernel_heap_psize = 65536
lgpg_regions = 0
lgpg_size = 0
low_ps_handling = 1
maxfree = 1088
maxperm = 843105
maxpin = 953840
maxpin% = 90
memory_frames = 1048576
memplace_data = 0
memplace_mapped_file = 0
memplace_shm_anonymous = 0
memplace_shm_named = 0
memplace_stack = 0
memplace_text = 0
memplace_unmapped_file = 0
minfree = 960
minperm = 28103
minperm% = 3
msem_nlocks = 0
nokilluid = 0
npskill = 1024
npswarn = 4096
num_locks_per_semid = 1
numpsblks = 131072
pgz_lpgrow = 2
pgz_mode = 2
# vmstat 1
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
2 0 70668 414 0 0 0 0 0 0 178 7364 257 35 14 0 51
1 0 70669 755 0 0 0 0 0 0 196 19119 272 40 20 0 41
1 0 70704 707 0 0 0 0 0 0 190 8506 272 37 8 0 55
1 0 70670 725 0 0 0 0 0 0 205 8821 313 41 10 0 49
6 4 73362 123 0 5 36 313 1646 0 361 16256 863 47 53 0 0
5 3 73547 126 0 6 26 152 614 0 324 18243 1248 39 61 0 0
4 4 73591 124 0 3 11 90 372 0 307 19741 1287 39 61 0 0
6 4 73540 127 0 4 30 122 358 0 340 20097 970 44 56 0 0
8 3 73825 116 0 18 22 220 781 0 324 16012 934 51 49 0 0
8 4 74309 26 0 45 62 291 1079 0 352 14674 972 44 56 0 0
2 9 75322 0 0 41 87 283 943 0 403 16950 1071 44 56 0 0
5 7 75020 74 0 23 119 410 1611 0 353 15908 854 49 51 0 0
The default value of the minfree parameter is increased to 960 per memory pool and the default value of
the maxfree parameter is increased to 1088 per memory pool.
List-based LRU
The LRU algorithm uses lists. In earlier versions of AIX, the page frame table method was also available.
The list-based algorithm provides a list of pages to scan for each type of segment.
The following is a list of the types of segments:
• Working
• Persistent
• Client
• Compressed
If WLM is enabled, there are lists for classes as well.
# vmstat -v
1048576 memory pages
936784 lruable pages
683159 free pages
1 memory pools
267588 pinned pages
90.0 maxpin percentage
3.0 minperm percentage
90.0 maxperm percentage
5.6 numperm percentage
52533 file pages
0.0 compressed percentage
0 compressed pages
5.6 numclient percentage
90.0 maxclient percentage
52533 client pages
0 remote pageouts scheduled
0 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
2228 filesystem I/Os blocked with no fsbuf
31 client filesystem I/Os blocked with no fsbuf
0 external pager filesystem I/Os blocked with no fsbuf
29.8 percentage of memory used for computational pages
The numperm percentage value shows the percentage of real memory used by file segments. The value
5.6% corresponds to 52533 file pages in memory.
The npskill value must be greater than zero and less than the total number of paging space pages
on the system. It can be changed with the command vmo -o npskill=value.
nokilluid
By setting the nokilluid option to a nonzero value with the vmo -o nokilluid command, user IDs lower
than this value will be exempt from being killed because of low page-space conditions. For example,
if nokilluid is set to 1, processes owned by root will be exempt from being killed when the npskill
threshold is reached.
# schedo -o pacefork=15
In this way, when the system retries the fork() call, there is a higher chance of success because some
processes might have finished their execution and, consequently, released pages from paging space.
Shared memory
By using the shmat() or mmap() subroutines, files can be explicitly mapped into memory. This process
avoids buffering and avoids system-call overhead.
The memory areas are known as the shared memory segments or regions. For the 32-bit applications that
were affected, the segment 14 was released to provide 11 shared memory segments that do not include
the shared library data or shared library text segments. This method applies for processes with segments
3-12 and 14. Each of these segments is 256 MB in size. Applications can read or write the file by reading
or writing to the segment. Applications can avoid overhead of read or write system calls by manipulating
pointers in these mapped segments.
Files or data can also be shared among multiple processes or threads. However, this requires
synchronization between these processes or threads and handling of such request depends on the
application. Typical use of the shared memory is by database applications, which uses the database as a
large database buffer cache.
Paging space is allocated for shared memory regions similar to the process private segment. The paging
space is used when the pages are accessed, if deferred page space allocation policy is turned off.
1 TB Segment Aliasing
1 TB segment aliasing improves performance by using 1 TB segment translations on Shared Memory
Regions with 256 MB segment size. This support is provided on all 64 bit applications that use Shared
Memory Regions. Both directed and undirected shared memory attachments are eligible for 1 TB segment
aliasing.
If an application qualifies to have its Shared Memory Regions to use 1 TB aliases, the AIX operating
system uses 1 TB segments translations without changing the application. This requires using the
shm_1tb_shared VMO tunable, shm_1tb_unshared VMO tunable, and esid_allocator VMO
tunable.
The shm_1tb_shared VMO tunable can be set on a per-process basis using the"SHM_1TB_SHARED="
VMM_CNTRL environment variable. The default value is set dynamically at boot time base on the
capabilities of the processor. If a single Shared Memory Region has the required number of ESIDs,
it is automatically changed to a shared alias. The acceptable values are in the range of 0 to 4 KB
(approximately require 256 MB ESIDs in a 1 TB range).
The shm_1tb_unshared VMO tunable can be set on a per-process basis using
the"SHM_1TB_UNSHARED=" VMM_CNTRL environment variable. The default value is set to 256. The
acceptable values are in a rage of 0 to 4 KB. The default value is set cautiously (requiring the population
of up to 64 GB address space) before moving to an unshared 1 TB alias. The threshold number is set to
256 MB segments at which a shared memory region is promoted to use a 1 TB alias. Lower values must
cautiously use the shared memory regions to use a 1 TB alias. This can lower the segment look-aside
buffer (SLB) misses but can also increase the page table entry (PTE) misses, if many shared memory
regions that are not used across processes are aliased.
The esid_allocator VMO tunable can be set on a per-process basis using the "ESID_ALLOCATOR="
VMM_CNTRL environment variable. The default value is set to 0 for AIX Version 6.1 and 1 for AIX Version
7.0. Values can be either 0 or 1. When set to 0, the old allocator for undirected attachments is enabled.
Otherwise, a new address space allocation policy is used for undirected attachments. This new address
space allocator attaches any undirected allocation (such as SHM, and MMAP) to a new address range of
0x0A00000000000000 - 0x0AFFFFFFFFFFFFFF in the address space of the application. The allocator
optimizes the allocations in order to provide the best possible chances of 1 TB alias promotion. Such
optimization can result in address space "holes", which are considered normal when using undirected
attachments. Directed attachments is done for 0x0700000000000000 - 0x07FFFFFFFFFFFFFF range,
thus preserving compatibility with earlier version. In certain cases where this new allocation policy
creates a binary compatibility issue, the legacy allocator behavior can be restored by setting the tunable
to 0.
Shared memory regions that were not qualified for shared alias promotion are grouped into 1 TB regions.
In a group of shared memory regions in a 1 TB region of the application's address space, if the application
exceeds the threshold value of 256 MB segments, they are promoted to use an unshared 1 TB alias. In
applications where the shared memory is frequently attached and detached, lower values of the unshared
alias threshold result in performance degradation.
VMM_CNTRL=SHM_1TB_UNSHARED=32@SHM_1TB_SHARED=5
All environment variable settings are inherited by the child on a fork(), and initialized to the system
default values at exec(). All 32-bit applications are not affected by either VMO or environment variable
tunable changes.
All VMO tunables and environment variables have analogous vm_pattr commands. The exception is
esid_allocator tunable. This tunable is not present in the vm_pattr options to avoid situations where
portions of the shared memory address space are allocated before running the command.
MEMORY_AFFINITY=MCM
This behavior is propagated across a fork. However, for this behavior to be retained across a call to the
exec function, the variable must be contained in the environment string passed to the exec function call.
Related information
The vmo command and “VMM page replacement tuning” on page 137.
The bindprocessor command or subroutine.
WLM Class and Resource Set Attributes.
Value Behavior
MCM Private memory is local and shared memory is local.
SHM=RR Both System V and Posix Real-Time shared memory are striped across the MCMs.
Applies to 4 KB and large-page-backed shared memory objects. This value is only
valid for the 64-bit kernel and if the MCM value is also defined.
LRU=EARLY The LRU daemon starts on local memory as soon as low thresholds, such as the
minfree parameter, are reached. It does not wait for all the system pools to reach the
low thresholds. This value is only valid if the MCM value is also defined.
You can set multiple values for the MEMORY_AFFINITY environment variable by separating each value
with the at sign, (@).
Large pages
The main purpose for large page usage is to improve system performance for high performance
computing (HPC) applications or any memory-access-intensive application that uses large amounts
of virtual memory. The improvement in system performance stems from the reduction of translation
lookaside buffer (TLB) misses due to the ability of the TLB to map to a larger virtual memory range.
Large pages also improve memory prefetching by eliminating the need to restart prefetch operations on 4
KB boundaries. AIX supports large page usage by both 32-bit and 64-bit applications.
The POWER4 large page architecture requires all the virtual pages in a 256 MB segment to be the same
size. AIX supports this architecture by using a mixed mode process model such that some segments in
a process are backed with 4 KB pages, while other segments are backed with 16 MB pages. Applications
can request that their heap segments or memory segments be backed with large pages. For detailed
information, refer to “Application configuration for large pages” on page 150.
AIX maintains separate 4 KB and 16 MB physical memory pools. You can specify the amount of physical
memory in the 16 MB memory pool using the vmo command. The large page pool is dynamic, so the
amount of physical memory that you specify takes effect immediately and does not require a system
reboot. The remaining physical memory backs the 4 KB virtual pages.
AIX treats large pages as pinned memory. AIX does not provide paging support for large pages. The
data of an application that is backed by large pages remains in physical memory until the application
completes. A security access control mechanism prevents unauthorized applications from using large
pages or large page physical memory. The security access control mechanism also prevents unauthorized
users from using large pages for their applications. For non-root user ids, you must enable the
CAP_BYPASS_RAC_VMM capability with the chuser command in order to use large pages. The following
example demonstrates how to grant the CAP_BYPASS_RAC_VMM capability as the superuser:
If you decide to no longer use large pages to back the data and heap segments, use the following
command to clear the large page flag:
You can also set the blpdata option when linking and binding with the cc command.
Mandatory mode
In mandatory mode, if an application requests a heap segment and there are not enough large pages to
satisfy the request, the allocation request fails, which causes most applications to terminate with an error.
If you use the mandatory mode, you must monitor the size of the large page pool and ensure that the pool
does not run out of large pages. Otherwise, your mandatory mode large page applications fail.
To use large pages for shared memory, you must enable the SHM_PIN shmget() system call with the
following command, which persists across system reboots:
# vmo -p -o v_pinshm=1
To see how many large pages are in use on your system, use the vmstat -l command as in the following
example:
# vmstat -l
From the above example, you can see that there are 16 active large pages, alp, and 16 free large pages,
flp.
As with all previous versions of AIX, 4 KB is the default page size. A process will continue to use 4 KB
pages unless a user specifically requests another page size be used.
In the example, the large page configuration changes will not take effect until you run the bosboot
command and reboot the system. On systems that support dynamic logical partitioning (DLPAR), the -r
option can be omitted from the previous command to dynamically configure 16 MB large pages without
rebooting the system. For more information about 16 MB large pages, see “Large pages” on page 149.
LDR_CNTRL environment
Region ld or ldedit option variable Description
Data -bdatapsize DATAPSIZE Initialized data, bss, heap
Stack -bstackpsize STACKPSIZE Initial thread stack
Text -btextpsize TEXTPSIZE Main executable text
Shared -bshmpsize SHMPSIZE Shared memory allocated by
Memory the process
Note: The -bshmpsize flag is supported only for 64-bit processes. In 32-bit process mode, the
-bshmpsize flag is ignored and a warning message is printed.
You can specify a different page size to use for each of the four regions of a process's address space. For
both interfaces, a page size should be specified in bytes. The specified page size may be qualified with a
suffix to indicate the unit of the size. The supported suffixes are:
Setting the preferred page sizes of an application with the ldedit or ld commands
You can set an application's preferred page sizes in its XCOFF/XCOFF64 binary with the ldedit or ld
commands.
The ld or cc commands can be used to set these page size options when you are linking an executable
file:
The ldedit command can be used to set these page size options in an existing executable file:
Note: The ldedit command requires that the value for a page size option be specified with an equal sign
(=), but the ld and cc commands require the value for a page size option be specified with a colon (:).
$ LDR_CNTRL=DATAPSIZE=4K@TEXTPSIZE=64K@STACKPSIZE=64K@SHMPSIZE=64K mpsize.out
The page size environment variables override any page size settings in an executable's XCOFF header.
Also, the DATAPSIZE environment variable overrides any LARGE_PAGE_DATA environment variable
setting.
# ps -Z
PID TTY TIME DPGSZ SPGSZ TPGSZ CMD
311342 pts/4 0:00 4K 4K 4K ksh
397526 pts/4 0:00 4K 4K 4K ps
487558 pts/4 0:00 64K 64K 4K sleep
# vmstat -P all
System configuration: mem=4096MB
pgsz memory page
----- -------------------------- ------------------------------------
siz avm fre re pi po fr sr cy
4K 542846 202832 329649 0 0 0 0 0 0
64K 31379 961 30484 0 0 0 0 0 0
# svmon -G
size inuse free pin virtual
memory 8208384 5714226 2494158 453170 5674818
pg space 262144 20653
Syntax
CPOagent [-f configuration file]
Flag
Item Description
-f Changes the default configuration file name. If this option is not specified, the
file name is assumed to be available at /usr/lib/perf/CPOagent.cf file
location.
TCPU=<n1>
TMEM=<n2>
PATI=<n3>
PATM=<n4>
PPTS=<n5>
TOPM=<n6>
PFLR=<c>
Fields Description
TCPU Specifies the CPU usage threshold per process, in percentage
Default: 25
Minimum: 10
Maximum: 100
PATI Specifies the page analysis time interval (PATI), in minutes. It specifies the time
interval at which candidate processes are analyzed to identify the pages that
can be consolidated and promoted to a higher size.
Default: 15
Minimum: 5
Maximum: 60
PATM Specifies the page analysis time monitor (PATM), in seconds. It specifies the
amount of time page usage statistics to be collected for identifying candidate
pages for page consolidation and promotion.
Default: 30
Minimum: 5
Maximum: 180
PPTS Specifies the page promotion trigger samples (PPTS). It specifies the number of
samples to be collected before triggering a page promotion.
Default: 4
Minimum: 4
PFLR Specifies the wildcard and the process that matches the wildcard are
considered by CPOagent for page consolidation and promotion. It is referred
as process filter wildcard (PFLR).
If the CPOagent is started, and it has the following sample CPOagent.cf file
--------------------------------------------
TCPU=25
TMEM=50
PATI=15
PATM=30
PPTS=4
TOPM=2
PFLR=Stress*
--------------------------------------------
According to the configuration file, CPOagent cycles for 15 minutes (PATI =15). For a specific 15 minutes
interval, it monitors the CPU and memory usage of the process that are running. Top 2 processes (TOPM
=2) with the process name having Stress (PFLR = Stress*), CPU Usage exceeding 25% (TCPU = 25), and
memory usage exceeding 50 MB (TMEM = 50) are the candidates for page consolidation and promotion.
This process verifies by collecting four samples (PPTS = 4) before triggering the algorithm for page
consolidation and promotion. Additionally, the page usage statistics is collected for 30 seconds (PATM
=30) to identify the candidate pages for page consolidation and promotion. Now with CPOagent running, it
The AIX operating system maintains a history of disk activity. If the disk I/O history is disabled, the
following message is displayed when you run the iostat command:
To enable disk I/O history, from the command line enter smit chgsys and then select true from the
Continuously maintain DISK I/O history field.
# cp bigfile /dev/null
tty: tin tout avg-cpu: % user % sys % idle % iowait physc % entc time
1.2 9.6 0.6 1.4 98.0 0.0 0.0 2.7 13:26:46
tty: tin tout avg-cpu: % user % sys % idle % iowait physc % entc time
0.2 3.6 0.3 13.8 75.1 10.8 0.2 16.8 13:26:51
tty: tin tout avg-cpu: % user % sys % idle % iowait physc % entc time
0.0 0.0 0.5 1.5 97.9 0.1 0.0 2.8 13:26:56
Note: If the iostat command is run without specifying a time interval, the output indicates a
summary of the system data since the last system reboot, and not the current values.
• The first and third intervals show that the three disks were mostly idle, along with the CPU utilization,
which is also shown as idle in the tty report.
• The second interval shows the activity that is generated by using the cp command, which was
started for this test. This activity can be viewed on both the CPU activity (tty report) which shows
13.9% sys CPU and also on the disk report. The cp command took 3.14 seconds to run during
this interval. In the report, the second interval shows 62.8 % for the hdisk0 disk under the
tm_act metric. This example means that the hdisk0 disk was busy for 62.8 % of the time interval
(5 seconds). If the cp command is the only process generating disk I/O to hdisk0, then the cp
command took 62.8% of the 5 second interval, or 3.14 seconds, which is the total time the cp
command took to run.
2. The following command stores five data samples with a 2 seconds interval between samples on a
system, which has three disks in the io.out2 file:
# cp bigfile /dev/null
System configuration: lcpu=4 drives=4 ent=1.00 paths=3 vdisks=2
tty: tin tout avg-cpu: % user % sys % idle % iowait physc % entc time
3.0 24.0 0.4 0.8 98.8 0.0 0.0 1.8 13:29:51
tty: tin tout avg-cpu: % user % sys % idle % iowait physc % entc time
0.5 1.0 0.2 8.2 85.5 6.1 0.1 10.1 13:29:53
tty: tin tout avg-cpu: % user % sys % idle % iowait physc % entc time
0.0 0.0 0.2 21.5 62.9 15.4 0.3 25.7 13:29:55
tty: tin tout avg-cpu: % user % sys % idle % iowait physc % entc time
0.0 8.0 1.3 7.2 87.5 4.0 0.1 10.4 13:29:57
tty: tin tout avg-cpu: % user % sys % idle % iowait physc % entc time
0.0 0.0 0.2 0.6 99.2 0.0 0.0 1.3 13:29:59
• The first and fifth interval show that the three disks were mostly idle, along with the CPU utilization,
which is also shown as idle in the tty report.
• The second interval shows the activity that is generated by using the cp command, which was started
for this test. The cp command took 3.14 seconds to run during this interval. In the report, the second
interval shows 39.5 % for the hdisk0 disk under the tm_act metric. In third and fourth interval shows
100 % and 20.9 % respectively for the hdisk0 disk under the tm_act metric. This means that the
hdisk0 disk was busy for 100 % of the time interval (2 seconds) during the third interval and the
hdisk0 disk was busy for 20.9 % of the time interval (2 seconds) during the fourth interval.
Both examples shows that the %tm_act variable only indicates that the disk was busy. You cannot use
the %tm_act variable to evaluate a performance problem. To troubleshoot an issue, you might need to
consider other options such as running the iostat -D flag, which provides real service times (both read and
write ) and queuing information for the disks on the system.
Related concepts
The iostat command
The iostat command is the fastest way to get a first impression, whether or not the system has a disk
I/O-bound performance problem.
Related reference
Continuous performance monitoring with the iostat command
The iostat command is useful for determining disk and CPU usage.
TTY report
The two columns of TTY information (tin and tout) in the iostat output show the number of characters
read and written by all TTY devices.
This includes both real and pseudo TTY devices. Real TTY devices are those connected to an
asynchronous port. Some pseudo TTY devices are shells, telnet sessions, and aixterm windows.
Because the processing of input and output characters consumes CPU resources, look for a correlation
between increased TTY activity and CPU utilization. If such a relationship exists, evaluate ways to improve
the performance of the TTY subsystem. Steps that could be taken include changing the application
program, modifying TTY port parameters during file transfer, or perhaps upgrading to a faster or more
efficient asynchronous communications adapter.
Drive report
The drive report contains performace-related information for physical drives.
When you suspect a disk I/O performance problem, use the iostat command. To avoid the information
about the TTY and CPU statistics, use the -d option. In addition, the disk statistics can be limited to the
important disks by specifying the disk names.
Remember that the first set of data represents all activity since system startup.
Disks:
Shows the names of the physical volumes. They are either hdisk or cd followed by a number. If
physical volume names are specified with the iostat command, only those names specified are
displayed.
% tm_act
Indicates the percentage of time that the physical disk was active (bandwidth utilization for the drive)
or, in other words, the total time disk requests are outstanding. A drive is active during data transfer
and command processing, such as seeking to a new location. The "disk active time" percentage
is directly proportional to resource contention and inversely proportional to performance. As disk
use increases, performance decreases and response time increases. In general, when the utilization
exceeds 70 percent, processes are waiting longer than necessary for I/O to complete because most
UNIX processes block (or sleep) while waiting for their I/O requests to complete. Look for busy versus
idle drives. Moving data from busy to idle drives can help alleviate a disk bottleneck. Paging to and
from disk will contribute to the I/O load.
Kbps
Indicates the amount of data transferred (read or written) to the drive in KB per second. This is the
sum of Kb_read plus Kb_wrtn, divided by the seconds in the reporting interval.
tps
Indicates the number of transfers per second that were issued to the physical disk. A transfer is an I/O
request through the device driver level to the physical disk. Multiple logical requests can be combined
into a single I/O request to the disk. A transfer is of indeterminate size.
Kb_read
Reports the total data (in KB) read from the physical volume during the measured interval.
Kb_wrtn
Shows the amount of data (in KB) written to the physical volume during the measured interval.
Taken alone, there is no unacceptable value for any of the above fields because statistics are too
closely related to application characteristics, system configuration, and type of physical disk drives and
adapters. Therefore, when you are evaluating data, look for patterns and relationships. The most common
relationship is between disk utilization (%tm_act) and data transfer rate (tps).
The disk xfer part provides the number of transfers per second to the specified physical volumes that
occurred in the sample interval. One to four physical volume names can be specified. Transfer statistics
are given for each specified drive in the order specified. This count represents requests to the physical
device. It does not imply an amount of data that was read or written. Several logical requests can be
combined into one physical request.
• The in column of the vmstat output
This column shows the number of hardware or device interrupts (per second) observed over the
measurement interval. Examples of interrupts are disk request completions and the 10 millisecond
clock interrupt. Since the latter occurs 100 times per second, the in field is always greater than 100. But
the vmstat command also provides a more detailed output about the system interrupts.
• The vmstat -i output
# vmstat -i 1 2
priority level type count module(handler)
0 0 hardware 0 i_misc_pwr(a868c)
0 1 hardware 0 i_scu(a8680)
0 2 hardware 0 i_epow(954e0)
0 2 hardware 0 /etc/drivers/ascsiddpin(189acd4)
1 2 hardware 194 /etc/drivers/rsdd(1941354)
3 10 hardware 10589024 /etc/drivers/mpsdd(1977a88)
3 14 hardware 101947 /etc/drivers/ascsiddpin(189ab8c)
5 62 hardware 61336129 clock(952c4)
10 63 hardware 13769 i_softoff(9527c)
priority level type count module(handler)
0 0 hardware 0 i_misc_pwr(a868c)
0 1 hardware 0 i_scu(a8680)
0 2 hardware 0 i_epow(954e0)
0 2 hardware 0 /etc/drivers/ascsiddpin(189acd4)
1 2 hardware 0 /etc/drivers/rsdd(1941354)
3 10 hardware 25 /etc/drivers/mpsdd(1977a88)
3 14 hardware 0 /etc/drivers/ascsiddpin(189ab8c)
5 62 hardware 105 clock(952c4)
10 63 hardware 0 i_softoff(9527c)
Note: The output will differ from system to system, depending on hardware and software configurations
(for example, the clock interrupts may not be displayed in the vmstat -i output although they will be
accounted for under the in column in the normal vmstat output). Check for high numbers in the count
column and investigate why this module has to execute so many interrupts.
# sar -d 3 3
# lslv -l hd2
hd2:/usr
PV COPIES IN BAND DISTRIBUTION
hdisk0 114:000:000 22% 000:042:026:000:046
The output of COPIES shows the logical volume hd2 has only one copy. The IN BAND shows how well
the intrapolicy, an attribute of logical volumes, is followed. The higher the percentage, the better the
allocation efficiency. Each logical volume has its own intrapolicy. If the operating system cannot meet this
requirement, it chooses the best way to meet the requirements. In our example, there are a total of 114
logical partitions (LP); 42 LPs are located on middle, 26 LPs on center, and 46 LPs on inner-edge. Since
the logical volume intrapolicy is center, the in-band is 22 percent (26 / (42+26+46). The DISTRIBUTION
shows how the physical partitions are placed in each part of the intrapolicy; that is:
See “Position on physical volume ” on page 184 for additional information about physical partitions
placement.
USED USED USED USED USED USED USED USED USED USED 18-27
USED USED USED USED USED USED USED 28-34
USED USED USED USED USED USED USED USED USED USED 35-44
USED USED USED USED USED USED 45-50
USED USED USED USED USED USED USED USED USED USED 51-60
0052 0053 0054 0055 0056 0057 0058 61-67
0059 0060 0061 0062 0063 0064 0065 0066 0067 0068 68-77
0069 0070 0071 0072 0073 0074 0075 78-84
USED USED USED USED USED USED USED USED USED USED 18-27
USED USED USED USED USED USED USED 28-34
USED USED USED USED USED USED USED USED USED USED 35-44
USED USED USED USED USED USED 45-50
0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 51-60
0011 0012 0013 0014 0015 0016 0017 61-67
0018 0019 0020 0021 0022 0023 0024 0025 0026 0027 68-77
0028 0029 0030 0031 0032 0033 0034 78-84
From top to bottom, five blocks represent edge, middle, center, inner-middle, and inner-edge,
respectively.
• A USED indicates that the physical partition at this location is used by a logical volume other than the
one specified. A number indicates the logical partition number of the logical volume specified with the
lslv -p command.
• A FREE indicates that this physical partition is not used by any logical volume. Logical volume
fragmentation occurs if logical partitions are not contiguous across the disk.
• A STALE physical partition is a physical partition that contains data you cannot use. You can also
see the STALE physical partitions with the lspv -m command. Physical partitions marked as STALE
must be updated to contain the same information as valid physical partitions. This process, called
resynchronization with the syncvg command, can be done at vary-on time, or can be started anytime
the system is running. Until the STALE partitions have been rewritten with valid data, they are not used
to satisfy read requests, nor are they written to on write requests.
In the previous example, logical volume hd11 is fragmented within physical volume hdisk1, with its first
logical partitions in the inner-middle and inner regions of hdisk1, while logical partitions 35-51 are in the
outer region. A workload that accessed hd11 randomly would experience unnecessary I/O wait time as
longer seeks might be needed on logical volume hd11. These reports also indicate that there are no free
physical partitions in either hdisk0 or hdisk1.
This example shows that there is very little fragmentation within the file, and those are small gaps.
We can therefore infer that the disk arrangement of big1 is not significantly affecting its sequential
# vmstat 5 8
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
0 1 72379 434 0 0 0 0 2 0 376 192 478 9 3 87 1
0 1 72379 391 0 8 0 0 0 0 631 2967 775 10 1 83 6
0 1 72379 391 0 0 0 0 0 0 625 2672 790 5 3 92 0
0 1 72379 175 0 7 0 0 0 0 721 3215 868 8 4 72 16
2 1 71384 877 0 12 13 44 150 0 662 3049 853 7 12 40 41
0 2 71929 127 0 35 30 182 666 0 709 2838 977 15 13 0 71
0 1 71938 122 0 0 8 32 122 0 608 3332 787 10 4 75 11
0 1 71938 122 0 0 0 3 12 0 611 2834 733 5 3 75 17
The fact that more paging-space page-ins than page-outs occurred during the compilation suggests that
we had shrunk the system to the point that thrashing begins. Some pages were being repaged because
their frames were stolen before their use was complete.
# vmstat -s >statout
# testpgm
# sync
# vmstat -s >> statout
# egrep "ins|outs" statout
yields a before and after picture of the cumulative disk activity counts, such as:
During the period when this command (a large C compile) was running, the system read a total of 981
pages (8 from paging space) and wrote a total of 449 pages (193 to paging space).
Tracing is started by the filemon command, optionally suspended with the trcoff subcommand and
resumed with the trcon subcomand. As soon as tracing is terminated, the filemon command writes its
report to stdout.
Note: Only data for those files opened after the filemon command was started will be collected, unless
you specify the -u flag.
The filemon command can read the I/O trace data from a specified file, instead of from the real-time
trace process. In this case, the filemon report summarizes the I/O activity for the system and period
At this point an adjusted trace logfile is fed into the filemon command to report on I/O activity captured
by a previously recorded trace session as follows:
In this example, the filemon command reads file system trace events from the input file trace.rpt.
Because the trace data is already captured on a file, the filemon command does not put itself in the
background to allow application programs to be run. After the entire file is read, an I/O activity report for
the virtual memory, logical volume, and physical volume levels is displayed on standard output (which, in
this example, is piped to the pg command).
If the trace command was run with the -C all flag, then run the trcrpt command also with the -C all
flag (see “Formatting a report from trace -C output ” on page 364).
The following sequence of commands gives an example of the filemon command usage:
------------------------------------------------------------------------
Detailed File Stats
------------------------------------------------------------------------
FILE: /dev/null
opens: 1
total bytes xfrd: 50600
writes: 13 (0 errs)
write sizes (bytes): avg 3892.3 min 1448 max 4096 sdev 705.6
write times (msec): avg 0.007 min 0.003 max 0.022 sdev 0.006
------------------------------------------------------------------------
Detailed VM Segment Stats (4096 byte pages)
------------------------------------------------------------------------
------------------------------------------------------------------------
Detailed Logical Volume Stats (512 byte blocks)
------------------------------------------------------------------------
Using the filemon command in systems with real workloads would result in much larger reports and
might require more trace buffer space. Space and CPU time consumption for the filemon command
can degrade system performance to some extent. Use a nonproduction system to experiment with the
filemon command before starting it in a production environment. Also, use offline processing and on
systems with many CPUs use the -C all flag with the trace command.
Note: Although the filemon command reports average, minimum, maximum, and standard deviation in
its detailed-statistics sections, the results should not be used to develop confidence intervals or other
formal statistical inferences. In general, the distribution of data points is neither random nor symmetrical.
or
# ncheck -i 858 /
/:
858 /smit.log
# lvmstat -l lvname -e
To disable the statistics for the lvmstat command for a specific logical volume, use the following
command:
# lvmstat -l lvname -d
To enable the statistics for the lvmstat command for all logical volumes in a volume group, use the
following command:
# lvmstat -v vgname -e
To disable the statistics for the lvmstat command for all logical volumes in a volume group, use the
following command:
# lvmstat -v vgname -d
When using the lvmstat command, if you do not specify an interval value, the output displays the
statistics for every partition in the logical volume. When you specify an interval value, in seconds, the
lvmstat command output only displays statistics for the particular partitions that have been accessed in
the specified interval. The following is an example of the lvmstat command:
# lvmstat -l lv00 1
You can use the -c flag to limit the number of statistics the lvmstat command displays. The -c flag
specifies the number of partitions with the most I/O activity that you want displayed. The following is an
example of using the lvmstat command with the -c flag:
# lvmstat -l lv00 -c 5
The above command displays the statistics for the 5 partitions with the most I/O activity.
If you do not specify the iterations parameter, the lvmstat command continues to produce output until
you interrupt the command. Otherwise, the lvmstat command displays statistics for the number of
iterations specified.
In using the lvmstat command, if you find that there are only a few partitions that are heavily used,
you might want to separate these partitions over different hard disks using the lvmstat command. The
migratelp command allows you to migrate individual partitions from one hard disk to another.
Physical partitions are numbered consecutively, starting with number one, from the outer-most edge to
the inner-most edge.
The edge and inner-edge strategies specify allocation of partitions to the edges of the physical volume.
These partitions have the slowest average seek times, which generally result in longer response times for
any application that uses them. Edge on disks produced since the mid-1990s can hold more sectors per
track so that the edge is faster for sequential I/O.
The middle and inner-middle strategies specify to avoid the edges of the physical volume and out of
the center when allocating partitions. These strategies allocate reasonably good locations for partitions
with reasonably good average seek times. Most of the partitions on a physical volume are available for
allocation using this strategy.
The center strategy specifies allocation of partitions to the center section of each physical volume. These
partitions have the fastest average seek times, which generally result in the best response time for any
application that uses them. Fewer partitions on a physical volume satisfy the center strategy than any
other general strategy.
The paging space logical volume is a good candidate for allocation at the center of a physical volume
if there is lot of paging activity. At the other extreme, the dump and boot logical volumes are used
infrequently and, therefore, should be allocated at the beginning or end of the physical volume.
The general rule, then, is that the more I/Os, either absolutely or in the course of running an important
application, the closer to the center of the physical volumes the physical partitions of the logical volume
should be allocated.
The minimum option indicates the number of physical volumes used to allocate the required physical
partitions. This is generally the policy to use to provide the greatest reliability and availability, without
having copies, to a logical volume. Two choices are available when using the minimum option, with copies
and without, as follows:
• Without Copies: The minimum option indicates one physical volume should contain all the physical
partitions of this logical volume. If the allocation program must use two or more physical volumes, it
uses the minimum number possible, remaining consistent with the other parameters.
• With Copies: The minimum option indicates that as many physical volumes as there are copies should
be used. If the allocation program must use two or more physical volumes, the minimum number of
physical volumes possible are used to hold all the physical partitions. At all times, the constraints
imposed by other parameters such as the strict option are observed.
These definitions are applicable when extending or copying an existing logical volume. The existing
allocation is counted to determine the number of physical volumes to use in the minimum with copies
case, for example.
The maximum option indicates the number of physical volumes used to allocate the required physical
partitions. The maximum option intends, considering other constraints, to spread the physical partitions
of this logical volume over as many physical volumes as possible. This is a performance-oriented option
and should be used with copies to improve availability. If an uncopied logical volume is spread across
multiple physical volumes, the loss of any physical volume containing a physical partition from that logical
volume is enough to cause the logical volume to be incomplete.
Stripe size
Strip size in bytes multiplied by the number of disks in the array equals the stripe size. Strip size can be
any power of 2, from 4 KB to 128 MB.
When defining a striped logical volume, at least two physical drives are required. The size of the logical
volume in partitions must be an integral multiple of the number of disk drives used. See “Tuning logical
volume striping ” on page 190 for a detailed discussion.
# lvmo -a
vgname = rootvg
pv_pbuf_count = 256
total_vg_pbufs = 768
max_vg_pbuf_count = 8192
pervg_blocked_io_count = 0
global_pbuf_count = 256
If you want to display the current values for another volume group, use the following command:
lvmo -v <vg_name> -a
To set the value for a tunable with the lvmo command, use the equal sign, as in the following example:
Note: In the following example, the pv_pbuf_count tunable is set to 257 in the redvg volume group.
vgname = redvg
pv_pbuf_count = 257
total_vg_pbufs = 257
max_vg_pbuf_count = 263168
pervg_blocked_io_count = 0
global_pbuf_count = 256
global_blocked_io_count = 20
Note: If you increase the pbuf value too much, you might see a degradation in performance or unexpected
system behavior.
Related information
lvmo Command
In an ordinary logical volume, the data addresses correspond to the sequence of blocks in the underlying
physical partitions. In a striped logical volume, the data addresses follow the sequence of stripe units. A
complete stripe consists of one stripe unit on each of the physical devices that contains part of the striped
logical volume. The LVM determines which physical blocks on which physical drives correspond to a block
being read or written. If more than one drive is involved, the necessary I/O operations are scheduled for
all drives simultaneously.
As an example, a hypothetical lvs0 has a stripe-unit size of 64 KB, consists of six 2 MB partitions, and
contains a journaled file system (JFS). If an application is reading a large sequential file and read-ahead
has reached a steady state, each read will result in two or three I/Os being scheduled to each of the disk
drives to read a total of eight pages (assuming that the file is on consecutive blocks in the logical volume).
The read operations are performed in the order determined by the disk device driver. The requested data
is assembled from the various pieces of input and returned to the application.
Although each disk device will have a different initial latency, depending on where its accessor was at the
beginning of the operation, after the process reaches a steady state, all three disks should be reading at
close to their maximum speed.
char *buffer;
buffer = malloc(MAXBLKSIZE+64);
buffer = ((int)buffer + 64) & ~0x3f;
If the striped logical volumes are on raw logical volumes and writes larger than 1.125 MB are being done
to these striped raw logical volumes, increasing the lvm_bufcnt parameter with the ioo command might
increase throughput of the write activity. See “File system buffer tuning” on page 229.
The example above is for a JFS striped logical volume. The same techniques apply to enhanced JFS,
except the ioo parameters used will be the enhanced JFS equivalents.
Also, it is not a good idea to mix striped and non-striped logical volumes in the same physical volume. All
physical volumes should be the same size within a set of striped logical volumes.
You can use SMIT (the fast path is smitty chgdsk) or the chdev command to change these parameters.
For example, if your system contained a non-IBM SCSI disk drive hdisk5, the following command enables
queuing for that device and sets its queue depth to 3:
If the disk array is attached through a SCSI-2 Fast/Wide SCSI adapter bus, it may also be necessary to
change the outstanding-request limit for that bus.
Item Descriptor
RAID 0 Striping
RAID 1 Mirroring
RAID 10 or 0+1 Mirroring and striping
In this example, the fscsi device instance is fscsi0. Fast fail logic is called when the adapter driver
receives an indication from the switch that there is a link event with a remote storage device port by way
of a Registered State Change Notification (RSCN) from the switch.
Fast I/O Failure is useful in situations where multipathing software is used. Setting the fc_err_recov
attribute to fast_fail can decrease the I/O fail times because of link loss between the storage device
and switch. This would support faster failover to alternate paths.
In single-path configurations, especially configurations with a single path to a paging device, the
delayed_fail default setting is recommended.
Fast I/O Failure requires the following:
• A switched environment. It is not supported in arbitrated loop environments, including public loop.
• FC 6227 adapter firmware, level 3.22A 1 or higher.
• FC 6228 adapter firmware, level 3.82A 1 or higher.
• FC 6239 adapter firmware, all levels.
• All subsequent FC adapter releases support Fast I/O Failure.
If any of these requirements are not met, the fscsi device logs an error log of type INFO indicating that
one of these requirements is not met and that Fast I/O Failure is not enabled.
Some FC devices support enablement and disablement of Fast I/O Failure while the device is in the
Available state. To verify whether a device supports the dynamic tracking function, use the lsattr
command. The Fast I/O Failure can be changed for the supporting devices without unconfiguration and
reconfiguration of the device or cycling the link. The changes must be requested when the storage area
network (SAN) fabric is stable. A request fails if the error recovery is active in SAN during the time of the
request.
Related information
lsattr Command
In this example, the fscsi device instance is fscsi0. Dynamic tracking logic is called when the adapter
driver receives an indication from the switch that there is a link event with a remote storage device port.
Dynamic tracking support requires the following configuration:
• A switched environment. It is not supported in arbitrated loop environments, including public loop.
• FC 6227 adapter firmware, level 3.22A 1 or higher.
• FC 6228 adapter firmware, level 3.82A 1 or higher.
• FC 6239 adapter firmware, all levels.
• All subsequent FC adapter releases support Fast I/O Failure.
• The worldwide Name (Port Name) and Node Names devices must remain constant, and the worldwide
Name device must be unique. Changing the worldwide Name or Node Name of an available or
online device can result in I/O failures. In addition, each FC storage device instance must have
world_wide_name and node_name attributes. Updated filesets that contain the sn_location attribute
(see the following bullet) must also be updated to contain both of these attributes.
• The storage device must provide a reliable method to extract a unique serial number for each LUN. The
AIX FC device drivers do not automatically detect the location of the serial number. The method for
serial number extraction must be provided by the storage vendor to support dynamic tracking for the
specific devices. This information is conveyed to the drivers by using the sn_location ODM attribute for
each storage device. If the disk or tape driver detects that the sn_location ODM attribute is missing, an
error log of type INFO is generated and dynamic tracking is not enabled.
Note: When the lsattr command is run on a hdisk, the sn_location attribute might not be displayed.
That is, the attribute name is not shown even though it is present in ODM.
• The FC device drivers can track devices on a SAN fabric, if the N_Port IDs on the fabric stabilize within
15 seconds. The SAN fabric is a fabric as seen from a single host bus adapter. If cables are not reseated
or N_Port IDs continue to change after the initial 15 seconds, I/O failures occur.
• Devices are not tracked across host bus adapters. Devices are tracked if they remain visible from the
same HBA that they are originally connected to.
For example, if device A is moved from one location to another on fabric A that is attached to host
bus adapter A (in other words, its N_Port on fabric A changes), the device is tracked without any user
intervention, and I/O to this device can continue.
However, if a device A is visible from HBA A but not from HBA B, and device A is moved from the fabric
that is attached to HBA A to the fabric attached to HBA B, device A is not accessible on fabric A nor on
fabric B. User intervention would be required to make it available on fabric B by running the cfgmgr
command. The AIX device instance on fabric A is not usable, and a device instance on fabric B must be
created. This device must be added manually to volume groups, multipath device instances, and so on.
This procedure is similar to removing a device from fabric A and adding a device to fabric B.
• No dynamic tracking can be performed for FC dump devices while an AIX system memory dump is in
progress. In addition, dynamic tracking is not supported during system restart or by running the cfgmgr
command. SAN changes cannot be made while any of these operations are in progress.
dynt
rk fc_err_recov FC Driver Behavior
no delayed_fail The default setting. This is legacy behavior existing in previous
versions of AIX. The FC drivers do not recover if the SCSI ID of
a device changes, and I/Os take longer to fail when a link loss
occurs between a remote storage port and switch. This might be
preferable in single-path situations if dynamic tracking support is
not a requirement.
no fast_fail If the driver receives a RSCN from the switch, this could indicate a
link loss between a remote storage port and switch. After an initial
15-second delay, the FC drivers query to see if the device is on the
fabric. If not, I/Os are flushed back by the adapter. Future retries or
new I/Os fail immediately if the device is still not on the fabric. If
the FC drivers detect that the device is on the fabric but the SCSI ID
has changed, the FC device drivers do not recover, and the I/Os fail
with PERM errors.
yes delayed_fail If the driver receives a RSCN from the switch, this could indicate a
link loss between a remote storage port and switch. After an initial
15-second delay, the FC drivers query to see if the device is on the
fabric. If not, I/Os are flushed back by the adapter. Future retries
or new I/Os fail immediately if the device is still not on the fabric,
although the storage driver (disk, tape, FastT) drivers might inject
a small delay (2-5 seconds) between I/O retries. If the FC drivers
detect that the device is on the fabric but the SCSI ID has changed,
the FC device drivers reroute traffic to the new SCSI ID.
yes fast_fail If the driver receives a Registered State Change Notification (RSCN)
from the switch, this could indicate a link loss between a remote
storage port and switch. After an initial 15-second delay, the FC
drivers query to see if the device is on the fabric. If not, I/Os
are flushed back by the adapter. Future retries or new I/Os fail
immediately if the device is still not on the fabric. The storage
driver (disk, tape, FastT) will likely not delay between retries. If the
FC drivers detect the device is on the fabric but the SCSI ID has
changed, the FC device drivers reroute traffic to the new SCSI ID.
When dynamic tracking is disabled, there is a marked difference between the delayed_fail and
fast_fail settings of the fc_err_recov attribute. However, with dynamic tracking enabled, the setting of
the fc_err_recov attribute is less significant. This is because there is some overlap in the dynamic tracking
and fast fail error-recovery policies. Therefore, enabling dynamic tracking inherently enables some of the
fast fail logic.
Modular I/O
The Modular I/O (MIO) library allows you to analyze and tune an application's I/O at the application level
for optimal performance.
Applications frequently have very little logic built into them to provide users the opportunity to optimize
the I/O performance of the application. The absence of application-level I/O tuning leaves the end user
at the mercy of the operating system to provide the tuning mechanisms for I/O performance. Typically,
multiple applications are run on a given system that have conflicting needs for high performance I/O
resulting, at best, in a set of tuning parameters that provide moderate performance for the application
mix. The MIO library addresses the need for an application-level method for optimizing I/O.
Benefits
• MIO, because it is so easy to implement, makes it very simple to analyze the I/O of an application.
• MIO allows to cache the I/O at the application level: you can optimize the I/O system call, then the
system interrupts.
• The pf cache can be configured for each file, or for a group of files, making it more configurable than the
OS cache.
• MIO can be used on I/O applications that run simultaneously, linking some of them with MIO and
configuring them to use the pf cache and DIRECT I/O to bypass the normal JFS and JFS2 cache. These
MIO-linked applications will release more OS cache space for the I/O applications that are not linked to
MIO.
• MIO cache is useful for large sequential-access files.
Cautions
• Misuse of the MIO library cache configuration can cause performance degradation. To avoid this, first
analyze the I/O policy of your application, then find the module option parameters that truly apply
to your situation and set the value of these parameters to help improve the performance of your
application. Examples of misuse the of MIO:
– For an application that accesses a file smaller than the OS memory size, if you configure direct
option of the pf module, you can degrade your performance.
– For random access files, a cache may degrade your performance.
• MIO cache is allocated with malloc subsystem in the application's address space, so be careful because
if the total MIO cache size is bigger than the available OS memory, the system will use the paging space.
This can cause performance degradation or operating-system failure.
MIO architecture
The Modular I/O library consists of five I/O modules that may be invoked at runtime on a per-file basis.
The modules currently available are:
MIO implementation
There are three methods available to implement MIO: redirection linking libtkio, redirection including
libmio.h, and explicit calls to MIO routines.
Implementation is easy using any of the three methods; however, redirection linking libtkio is
recommended.
1. To implement MIO using this method, add two lines to your source code:
#define USE_MIO_DEFINES
#include "libmio.h"
MIO_STATS
Use MIO_STATS to point to a log file for diagnostic messages and for output requested from the MIO
modules.
It is interpreted as a file name with 2 special cases. If the file is either stderr or stdout the output will
be directed towards the appropriate file stream. If the file name begins with a plus (+) sign, such as
+filename.txt, then the file will be opened for appending; if there is no plus sign preceding the file name
then the file will be overwritten.
MIO_FILES
MIO_FILES provides the key to determine which modules are called for a given file when MIO_open64 is
called.
The format for MIO_FILES is:
When MIO_open64 is called MIO checks for the existence of the MIO_FILES environment variable. If the
environment variable is present MIO parses its data to determine which modules to invoke for which files.
MIO_FILES is parsed from left to right. All characters preceding a left bracket ([) are taken as a
file_name_list. A file_name_list is a list of file_name_template patterns that are separated by colons (:).
The MIO_open64 subroutine opens the test.dat file and matches its name with the *.dat
file_name_template pattern, resulting in the invocation of the mio, trace, and aix modules.
The MIO_open64 subroutine opens the test.f02 file and matches its name with *.f02, the second
file_name_template pattern in the second file_name_list, resulting in the invocation of the mio, trace, pf,
trace, and aix modules.
Each module has its own hardcoded default options for a default call to the environment variable. You
can override the default options by specifying values the associated MIO_FILES module list. The following
code example turns on statistics for the trace module and redirects that the output be directed to
themy.stats file:
The options for a module are delimited with a forward slash (/). Some options require an associated
integer value or a string value. For options that require a string value, if the string includes a forward slash
(/), enclose the string in braces {}. For options that require an integer value, you might append the integer
value with a k, m, g, or t to represent kilobytes, megabytes, gigabytes, or terabytes. Integer values can
also be entered in base 10, 8, or 16. If the integer value uses a prefix with 0x, the integer is interpreted as
base 16. If the integer value uses a prefix with 0, the integer is interpreted as base 8. If these two tests
fail, the integer is interpreted as base 10.
MIO_DEFAULTS
The purpose of the MIO_DEFAULTS environment variable is to aid in the readability of the data stored in
the MIO_FILES environment variable.
If the user specifies several modules for multiple file_name_list and module list pairs, then the MIO_FILES
environment variable can become quite long. If the user repeatedly overrides the hard-coded defaults
in the same manner, it is simpler to specify new defaults for a module by using the MIO_DEFAULTS
Now any default invocation of the trace module will have binary event tracing enabled and directed
towards the prob.events file and any default invocation of the aix module will have the debug option
enabled.
MIO_DEBUG
The purpose of the MIO_DEBUG environment variable is to aid in debugging MIO.
MIO searches MIO_DEFAULTS for keywords and provides debugging output for the option. The available
keywords are:
ALL
Turns on all the MIO_DEBUG environment variable keywords.
ENV
Outputs environment variable matching requests.
OPEN
Outputs open requests made to the MIO_open64 subroutine.
MODULES
Outputs modules invoked for each call to the MIO_open64 subroutine.
TIMESTAMP
Places into a stats file a timestamp preceding each entry.
DEF
Outputs the definition table of each module. This dump is executed for all MIO library modules when
the file opens.
fullwrite
All writes are expected to be full. If there is a write failure due to insufficient space, the module will
retry. This is the default option.
debug
Print debug statements for open and close.
nodebug
Do not print debug statements for open and close. This is the default value.
sector_size
Specific sector size. If not set the sector size equals the file system size.
notrunc
Do not issue trunc system calls. This is needed to avoid problems with JFS O_DIRECT errors.
trunc
Issue trunc system calls. This is the default option.
#!/bin/csh
#
setenv TKIO_ALTLIB "libmio.a(get_mio_ptrs.so)"
#
./example file.dat
#define USE_MIO_DEFINES
#include "libmio.h"
This script sets the MIO environment variables, compiles and links the application with the MIO library,
and to calls it.
#!/bin/csh
#
setenv MIO_STATS example.stats
setenv MIO_FILES " *.dat [ trace/stats ] "
setenv MIO_DEFAULTS " trace/kbytes "
setenv MIO_DEBUG OPEN
#
cc -o example example.c -lmio
#
./example file.dat
Header elements
• Date
• Hostname
• aio is enable or not
• Program name
• MIO library version
• Environment variables
Debug elements
• List of all setting debug options
Sample
MIO statistics file : Tue May 10 14:14:08 2005
hostname=host1 : with Legacy aio available
Program=/mio/example
MIO library libmio.a 3.0.0.60 AIX 32 bit addressing built Apr 19 2005 15:08:17
MIO_INSTALL_PATH=
MIO_STATS =example.stats
MIO_DEBUG =OPEN
MIO_FILES = *.dat [ trace/stats ]
MIO_DEFAULTS = trace/kbytes
MIO_DEBUG OPEN =T
<bytes written out of the cache from the child>/<number of partial page written>
program <-- <bytes read out of the cache by parent>/<number of read from parent> <-- pf <--
<bytes read in from child of the cache>/<number of page read from child>
Sample
pf close for /home/user1/pthread/258/SM20182_0.SCR300
50 pages of 2097152 bytes 131072 bytes per sector
133/133 pages not preread for write
23 unused prefetches out of 242 : prefetch=2
95 write behinds
mbytes transferred / Number of requests
program --> 257/257 --> pf --> 257/131 --> aix
program <-- 269/269 <-- pf <-- 265/133 <-- aix
15:30:00
recov : command=ls -l file=file.dat errno=28 try=0
recov : failure : new_ret=-1
OS configuration
The alot_buf application accomplishes the following:
• Writes a 14 GB file.
• 140 000 sequential writes with a 100 KB buffer.
• Reads the file sequentially with a 100 KB buffer.
• Reads the file backward sequentially with buffer 100 KB.
# vmstat
System Configuration: lcpu=2 mem=512MB
# ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) 131072
stack(kbytes) 32768
memory(kbytes) 32768
coredump(blocks) 2097151
nofiles(descriptors) 2000
# df -k /mio
Filesystem 1024-blocks Free %Used Iused %Iused Mounted on
/dev/fslv02 15728640 15715508 1% 231 1% /mio
# lslv fslv02
LOGICAL VOLUME: fslv02 VOLUME GROUP: mio_vg
LV IDENTIFIER: 000b998d00004c00000000f17e5f50dd.2 PERMISSION: read/write
VG STATE: active/complete LV STATE: opened/syncd
TYPE: jfs2 WRITE VERIFY: off
MAX LPs: 512 PP SIZE: 32 megabyte(s)
COPIES: 1 SCHED POLICY: parallel
LPs: 480 PPs: 480
STALE PPs: 0 BB POLICY: relocatable
INTER-POLICY: minimum RELOCATABLE: yes
INTRA-POLICY: middle UPPER BOUND: 32
MOUNT POINT: /mio LABEL: /mio
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?: NO
time /mio/alot_buf
Note: The output diagnostic file is mio_analyze.stats for the debug data and the trace module data.
All values are in kilobytes.
Note: The time command instructs MIO to post the time of the execution of a command.
MIO_DEBUG OPEN =T
MIO_DEBUG MODULES =T
MIO_DEBUG TIMESTAMP =T
17:32:22
Opening file test.dat
modules[18]=trace/stats/kbytes
trace/stats={mioout}/noevents/kbytes/nointer
aix/nodebug/trunc/sector_size=0/einprogress=60
============================================================================
18:00:28
Note:
• 140 000 writes of 102 400 bytes.
• 280 000 reads of 102 400 bytes.
• rate of 27 741.92 KB/s.
time /mio/alot_buf
• A good way to analyse the I/O of your application is to use the trace | pf | trace module list. This
way you can get the performance that the application sees from the pf cache and also the performance
that the pf cache sees from the operating system.
• The pf global cache is 100 MB in size. Each page is 2 MB. The number of pages to prefetch is four. The
pf cache does asynchronous direct I/O system calls.
• The output diagnostic file is mio_pf.stats for the debug data, the trace module data, and the pf
module data. All value are in kilobytes.
MIO_DEBUG OPEN =T
MIO_DEBUG MODULES =T
MIO_DEBUG TIMESTAMP =T
17:10:12
Opening file test.dat
modules[79]=trace/stats/kbytes|pf/cache=100m/page=2m/pref=4/stats/direct|trace/stats/kbytes
trace/stats={mioout}/noevents/kbytes/nointer
pf/nopffw/release/global=0/asynchronous/direct/bytes/cache_size=100m/page_size=2m/prefetch=4/st
ride=1/stats={mioout}/nointer/noretain/nolistio/notag/noscratch/passthru={0:0}
trace/stats={mioout}/noevents/kbytes/nointer
aix/nodebug/trunc/sector_size=0/einprogress=60
============================================================================
17:25:53
Trace close : pf <-> aix : test.dat : (41897728/619.76)=67603.08 kbytes/s
demand rate=44527.71 kbytes/s=41897728/(940.95-0.01))
current size=14000000 max_size=14000000
mode =0640 FileSystemType=JFS2 sector size=4096
oflags =0x8000302=RDWR CREAT TRUNC DIRECT
open 1 0.01
ill form 0 mem misaligned 0
write 1 0.21 1920 1920 1966080 1966080
awrite 6835 0.20 13998080 13998080 2097152 2097152
17:25:53
pf close for test.dat
50 pages of 2097152 bytes 4096 bytes per sector
6840/6840 pages not preread for write
7 unused prefetches out of 20459 : prefetch=4
6835 write behinds
bytes transferred / Number of requests
program --> 14336000000/140000 --> pf --> 14336000000/6836 --> aix
program <-- 28672000000/280000 <-- pf <-- 28567273472/13622 <-- aix
17:25:53
pf close for global cache 0
50 pages of 2097152 bytes 4096 bytes per sector
6840/6840 pages not preread for write
7 unused prefetches out of 20459 : prefetch=0
6835 write behinds
bytes transferred / Number of requests
program --> 14336000000/140000 --> pf --> 14336000000/6836 --> aix
program <-- 28672000000/280000 <-- pf <-- 28567273472/13622 <-- aix
17:25:53
Trace close : program <-> pf : test.dat : (42000000/772.63)=54359.71 kbytes/s
demand rate=44636.36 kbytes/s=42000000/(940.95-0.01))
current size=14000000 max_size=14000000
mode =0640 FileSystemType=JFS2 sector size=4096
oflags =0x302=RDWR CREAT TRUNC
open 1 0.01
write 140000 288.88 14000000 14000000 102400 102400
read 280000 483.75 28000000 28000000 102400 102400
seek 140003 13.17 average seek delta=-307192
fcntl 2 0.00
close 1 0.00
size 140000
============================================================================
Note: The program executes 140 000 writes of 102 400 bytes and 280 000 reads of 102 400 bytes, but
the pf module executes 6 836 writes (of which 6 835 are asynchronous writes) of 2 097 152 bytes and
executes 13 622 reads (of which 13 619 are asynchronous reads) of 2 097 152 bytes. The rate is 54
359.71 KB/s.
Enhanced JFS
Enhanced JFS, or JFS2, is another native AIX journaling file system.
Enhanced JFS is the default file system for 64-bit kernel environments. Due to address space limitations
of the 32–bit kernel, Enhanced JFS is not recommended for use in 32-bit kernel environments.
Support for data sets is integrated into JFS2 as part of the AIX operating system. A data set is a unit of
data administration. It consists of a directory tree with at least one single root directory. Administration
might include creating new data sets, creating and maintaining full copies (replicas) of data sets across
a collection of servers, or moving a data set to another server. A data set might exist as a portion of a
mounted file system. That is, a mounted file system instance might contain multiple data sets. Data set
support is enabled in JFS2 by using the mkfs -o dm=on command. By default, data set support is not
enabled. A data set enabled JFS2 instance can then be managed through the Dataset Services Manager
(DSM).
Note:
Journaling
Before writing actual data, a journaling file system logs the metadata, thus incurring an overhead penalty
that slows write throughput.
One way of improving performance under JFS is to disable metadata logging by using the nointegrity
mount option. Note that the enhanced performance is achieved at the expense of metadata integrity.
Therefore, use this option with extreme caution because a system crash can make a file system mounted
with this option unrecoverable.
In contrast to JFS, Enhanced JFS does not allow you to disable metadata logging. However, the
implementation of journaling on Enhanced JFS makes it more suitable to handle metadata-intensive
applications. Thus, the performance penalty is not as high under Enhanced JFS as it is under JFS.
Directory organization
An index node, or i-node, is a data structure that stores all file and directory properties. When a program
looks up a file, it searches for the appropriate i-node by looking up a file name in a directory.
Because these operations are performed often, the mechanism used for searching is of particular
importance.
JFS employs a linear organization for its directories, thus making searches linear as well. In contrast,
Enhanced JFS employs a binary tree representation for directories, thus greatly accelerating access to
files.
Scaling
The main advantage of using Enhanced JFS over JFS is scaling.
Enhanced JFS provides the capability to store much larger files than the existing JFS. The maximum size
of a file under JFS is 64 gigabytes. Under Enhanced JFS, AIX currently supports files up to 16 terabytes in
size, although the file system architecture is set up to eventually handle file sizes of up to 4 petabytes.
Another scaling issue relates to accessing a large number of files. The following illustration demonstrates
how Enhanced JFS can improve performance for this type of access.
The above example consists of creating, deleting, and searching directories with unique, 10-byte file
names. The results show that creating and deleting files is much faster under Enhanced JFS than under
JFS. Performance for searches was approximately the same for both file system types.
The example below shows how results for create, delete, and search operations are generally much faster
on Enhanced JFS than on JFS when using non-unique file names. In this example, file names were chosen
to have the same first 64-bytes appended by 10-byte unique names. The following illustration shows the
results of this test:
# cd /home/op
# find . -print | backup -ivf/tmp/op.backup
This command creates a backup file (in a different file system), containing all of the files in the file
system that is to be reorganized. If disk space on the system is limited, you can use tape to back up the
file system.
2. Run the following commands:
# cd /
# unmount /home/op
If any processes are using /home/op or any of its subdirectories, you must terminate those processes
before the unmount command can complete successfully.
3. Re-create the file system on the /home/op logical volume, as follows:
# mkfs /dev/hd11
You are prompted for confirmation before the old file system is destroyed. The name of the file system
does not change.
4. To restore the original situation (except that /home/op is empty), run the following:
Standard output is redirected to /dev/null to avoid displaying the name of each of the files that were
restored, which is time-consuming.
6. Review the large file inspected earlier (see File placement assessment with the fileplace command), as
follows:
---------------------------------- ----------------
In this example, minpgahead is 2 and maxpgahead is 8 (the defaults). The program is processing the file
sequentially. Only the data references that have significance to the read-ahead mechanism are shown,
designated by A through F. The sequence of steps is:
A
The first access to the file causes the first page (page 0) of the file to be read. At this point, the VMM
makes no assumptions about random or sequential access.
B
When the program accesses the first byte of the next page (page 1), with no intervening accesses to
other pages of the file, the VMM concludes that the program is accessing sequentially. It schedules
minpgahead (2) additional pages (pages 2 and 3) to be read. Thus access B causes a total of 3 pages
to be read.
C
When the program accesses the first byte of the first page that has been read ahead (page 2), the
VMM doubles the page-ahead value to 4 and schedules pages 4 through 7 to be read.
D
When the program accesses the first byte of the first page that has been read ahead (page 4), the
VMM doubles the page-ahead value to 8 and schedules pages 8 through 15 to be read.
# ioo –h j2_syncPageCount
Sets the maximum number of modified pages of a file that is written to disk by the sync system call in a
single operation.
Values: Default: 0 Range: 0-65536
Type: Dynamic
Unit: 4 KB pages
Tuning: When running an application that uses file system caching and does large numbers of random
writes, it is necessary to adjust this setting to avoid lengthy application delays during sync operations.
The values must be in the range of 256 to 1024. The default value is zero that results in the normal sync
behavior of writing all dirty pages in a single call. If small values for the tunables are set, it results in
longer sync times and shorter delays in application response time. If larger values are set, then response
time delays are longer and sync times are shorter.
# ioo –h j2_syncPageLimit
Sets the maximum number of times the sync system call uses the j2_syncPageCount, to limit pages
that are written to improve the sync operation performance.
Values: Default: 256 Range: 16-65536
Type: Dynamic
Unit: Numeric
Tuning: Is set when j2_syncPageCount is set and must be increased, if the effect of the
j2_syncPageCount change is insufficient. The acceptable values are in the range of 250 to 8000.
j2_syncPageLimit has no effect if j2_syncPageCount is 0.
This tunable must be set when j2_syncPageCount is set and must be increased, so that the effect of
thej2_syncPageCount change does not reduce the application response time.
The values must be in the range of 1 to 8000. Optimum value for these tunables is dependent on the
memory size and I/O bandwidth. A neutral starting point is to set both these tunables to 256.
...
0 paging space I/Os blocked with no psbuf
2740 filesystem I/Os blocked with no fsbuf
0 external pager filesystem I/Os blocked with no fsbuf
...
The paging space I/Os blocked with no psbuf and the filesystem I/Os blocked with
no fsbuf counters are incremented whenever a bufstruct is unavailable and the VMM puts a thread on
the VMM wait list. The external pager filesystem I/Os blocked with no fsbuf counter is
incremented whenever a bufstruct on an Enhanced JFS file system is unavailable
or
or
# smitty mklv
3. Modify /etc/filesystems and the logical volume control block (LVCB) as follows:
Tuning the maxpout and minpout parameters might prevent any thread that is doing sequential writes to
a file from dominating system resources.
The following table demonstrates the response time of a session of the vi editor on an IBM eServer™
pSeries model 7039-651, configured as a 4-way system with a 1.7 GHz processor, with various values for
the maxpout and the minpout parameters while writing to disk:
The best range for the maxpout and minpout parameters depends on the CPU speed and the I/O system.
I/O pacing works well if the value of the maxpout parameter is equal to or greater than the value of
the j2_nPagesPerWriteBehindCluster parameter. For example, if the value of the maxpout parameter is
equal to 64 and the minpout parameter is equal to 32, there are at most 64 pages in I/O state and 2 I/Os
before blocking on the next write.
The default tuning parameters are as follows:
For Enhanced JFS, you can use the ioo -o j2_nPagesPerWriteBehindCluster command to
specify the number of pages to be scheduled at one time. The default number of pages for an Enhanced
JFS cluster is 32, which implies a default size of 128 KB for Enhanced JFS. You can use the ioo -o
Network performance
AIX provides several different communications protocols, as well as tools and methods to monitor and
tune them.
Adapter placement
Network performance is dependent on the hardware you select, like the adapter type, and the adapter
placement in the machine.
To ensure best performance, you must place the network adapters in the I/O bus slots that are best suited
for each adapter.
When attempting to determine which I/O bus slot is most suitable, consider the following factors:
• PCI-X versus PCI adapters
• 64-bit versus 32-bit adapters
• supported bus-slot clock speed (33 MHz, 50/66 MHz, or 133 MHz)
The higher the bandwidth or data rate of the adapter, the more critical the slot placement. For example,
PCI-X adapters perform best when used in PCI-X slots, as they typically run at 133 MHz clock speed on
the bus. You can place PCI-X adapters in PCI slots, but they run slower on the bus, typically at 33 MHz or
66 MHz, and do not perform as well on some workloads.
Similarly, 64-bit adapters work best when installed in 64-bit slots. You can place 64-bit adapters in a
32-bit slot, but they do not perform at optimal rates. Large MTU adapters, like Gigabit Ethernet in jumbo
frame mode, perform much better in 64-bit slots.
Other issues that potentially affect performance are the number of adapters per bus or per PCI host
bridge (PHB). Depending on the system model and the adapter type, the number of high speed adapters
might be limited per PHB. The placement guidelines ensure that the adapters are spread across the
The newer IBM Power Systems processor-based servers only have PCI-X slots. The PCI-X slots are
backwards-compatible with the PCI adapters.
The following table shows examples of common adapters and the suggested slot types:
# lsslot -c pci
# Slot Description Device(s)
U0.1-P1-I1 PCI-X capable, 64 bit, 133 MHz slot fcs0
U0.1-P1-I2 PCI-X capable, 32 bit, 66 MHz slot Empty
U0.1-P1-I3 PCI-X capable, 32 bit, 66 MHz slot Empty
U0.1-P1-I4 PCI-X capable, 64 bit, 133 MHz slot fcs1
U0.1-P1-I5 PCI-X capable, 64 bit, 133 MHz slot ent0
U0.1-P1-I6 PCI-X capable, 64 bit, 133 MHz slot ent2
For a Gigabit Ethernet adapter, the adapter-specific statistics at the end of the entstat -d en[interface-
number] command output or the netstat -v command output shows the PCI bus type and bus speed of
the adapter. The following is an example output of the netstat -v command:
# netstat -v
System firmware
The system firmware is responsible for configuring several key parameters on each PCI adapter as well as
configuring options in the I/O chips on the various I/O and PCI buses in the system.
In some cases, the firmware sets parameters unique to specific adapters, for example the PCI Latency
Timer and Cache Line Size, and for PCI-X adapters, the Maximum Memory Read Byte Count (MMRBC)
values. These parameters are key to obtaining good performance from the adapters. If these parameters
are not properly set because of down-level firmware, it will be impossible to achieve optimal performance
by software tuning alone. Ensure that you update the firmware on older systems before adding new
adapters to the system.
You can see both the platform and system firmware levels with the lscfg -vp|grep -p " ROM" command,
as in the following example:
...lines omitted...
System Firmware:
ROM Level (alterable).......M2P030828
Version.....................RS6K
System Info Specific.(YL)...U0.1-P1/Y1
Physical Location: U0.1-P1/Y1
SPCN firmware:
ROM Level (alterable).......0000CMD02252
Version.....................RS6K
System Info Specific.(YL)...U0.1-P1/Y3
Physical Location: U0.1-P1/Y3
SPCN firmware:
ROM Level (alterable).......0000CMD02252
Version.....................RS6K
System Info Specific.(YL)...U0.2-P1/Y3
Physical Location: U0.2-P1/Y3
Platform Firmware:
ROM Level (alterable).......MM030829
Version.....................RS6K
System Info Specific.(YL)...U0.1-P1/Y2
Physical Location: U0.1-P1/Y2
Table 5. Maximum network payload speeds versus simplex TCP streaming rates
Network type Raw bit rate (Mbits) Payload rate (Mb) Payload rate (MB)
10 Mb Ethernet, Half 10 6 0.7
Duplex
10 Mb Ethernet, Full 10 (20 Mb full duplex) 9.48 1.13
Duplex
100 Mb Ethernet, Half 100 62 7.3
Duplex
100 Mb Ethernet, Full 100 (200 Mb full duplex) 94.8 11.3
Duplex
1000 Mb Ethernet, Full 1000 (2000 Mb full 948 113.0
Duplex, MTU 1500 duplex)
1000 Mb Ethernet, Full 1000 (2000 Mb full 989 117.9
Duplex, MTU 9000 duplex)
10 Gb Ethernet, Full 10000 7200 (peak 9415)1 858 (peak 1122)1
Duplex, MTU 1500 (with
RFC1323 enabled)
10 Gb Ethernet, Full 10000 9631 (peak 9891)1 1148 (peak 1179)1
Duplex, MTU 9000 (with
RFC1323 enabled)
FDDI, MTU 4352 100 92 11.0
(default)
Asynchronous Transfer 155 125 14.9
Mode (ATM) 155, MTU
1500
ATM 155, MTU 9180 155 133 15.9
(default)
ATM 622, MTU 1500 622 364 43.4
1The values in the table indicate rates for dedicated adapters on dedicated partitions. Performance for 10
Gigabit Ethernet adapters in virtual Ethernet Adapter (in VIOS) or Shared Ethernet Adapters (SEA) or for
shared partitions (shared LPAR) is not represented in the table because performance is impacted by other
variables and tuning that is outside the scope of this table.
Two direction (duplex) TCP streaming workloads have data streaming in both directions. For example,
running the ftp command from system A to system B and another instance of the ftp command from
system B to A concurrently is considered duplex TCP streaming. These types of workloads take advantage
of full duplex media that can send and receive data concurrently. Some media, like Fibre-Distributed Data
Interface (FDDI) or Ethernet in Half Duplex mode, cannot send and receive data concurrently and does
not perform well when running duplex workloads. Duplex workloads do not scale to twice the rate of a
simplex workload because the TCP acknowledge packets that are coming back from the receiver must
compete with the data packets that are flowing in the same direction. The following table lists the two
direction (duplex) TCP streaming rates:
Table 6. Maximum network payload speeds versus duplex TCP streaming rates
Network type Raw bit rate (Mbits) Payload rate (Mb) Payload rate (MB)
10 Mb Ethernet, Half 10 5.8 0.7
Duplex
10 Mb Ethernet, Full 10 (20 Mb full duplex) 18 2.2
Duplex
100 Mb Ethernet, Half 100 58 7.0
Duplex
100 Mb Ethernet, Full 100 (200 Mb full duplex) 177 21.1
Duplex
1000 Mb Ethernet, Full 1000 (2000 Mb full 1811 (1667 peak) 1 215 (222 peak) 1
Duplex, MTU 1500 duplex)
1000 Mb Ethernet, Full 1000 (2000 Mb full 1936 (1938 peak) 1 231 (231 peak) 1
Duplex, MTU 9000 duplex)
10 Gb Ethernet, Full 10000 (20000 Mb full 14400 (18448 peak) 1 1716 (2200 peak) 1
Duplex, MTU 1500 duplex)
10 Gb Ethernet, Full 10000 (20000 Mb full 18000 (19555 peak) 1 2162 (2331 peak) 1
Duplex, MTU 9000 duplex)
FDDI, MTU 4352 100 97 11.6
(default)
ATM 155, MTU 1500 155 (310 Mb full duplex) 180 21.5
ATM 155, MTU 9180 155 (310 Mb full duplex) 236 28.2
(default)
ATM 622, MTU 1500 622 (1244 Mb full 476 56.7
duplex)
ATM 622, MTU 9180 622 (1244 Mb full 884 105
(default) duplex)
[Entry Fields]
Ethernet Adapter ent0
Description 10/100/1000 Base-TX PCI-X Adapter (14106902)
Status Available
Location 1H-08
Receive descriptor queue size [1024] +#
Transmit descriptor queue size [512] +#
Software transmit queue size [8192] +#
Transmit jumbo frames yes +
Enable hardware transmit TCP resegmentation yes +
Enable hardware transmit and receive checksum yes +
Media Speed Auto_Negotiation +
Enable ALTERNATE ETHERNET address no +
ALTERNATE ETHERNET address [0x000000000000] +
Apply change to DATABASE only no +
------------------------------------------------------------------------------------------------
-
------------------------------------------------------------------------------------------------
-
sockthresh 85 85 85 0 100 %_of_thewall D
------------------------------------------------------------------------------------------------
-
fasttimo 200 200 200 50 200 millisecond D
------------------------------------------------------------------------------------------------
-
inet_stack_size 16 16 16 1 kbyte R
------------------------------------------------------------------------------------------------
-
...lines omitted....
TYPE = parameter type: D (for Dynamic), S (for Static), R for Reboot),B (for Bosboot), M (for
Mount),
I (for Incremental) and C (for Connect)
Some network attributes are run-time attributes that can be changed at any time. Others are load-time
attributes that must be set before the netinet kernel extension is loaded.
Note: When you use the no command to change parameters, dynamic parameters are changed in
memory and the change is in effect only until the next system boot. At that point, all parameters are
set to their reboot settings. To make dynamic parameter changes permanent, use the -ror -p options of
the no command to set the options in the nextboot file. Reboot parameter options require a system
reboot to take affect.
# no -o tcp_fastlo=1
# no -o tcp_fastlo_crosswpar=1
Note: The two options tcp_fastlo and tcp_fastlo_crosswpar are currently disabled (set to 0) by
default. These options is reserved for future AIX releases.
The TCP fastpath loopback traffic is accounted for in separate statistics by the netstat command, when
the TCP connection is open. It is not accounted to the loopback interface. However, the TCP fastpath
loopback does use the TCP/IP and loopback device to establish and terminate the fast path connections,
therefore these packets are accounted for in the normal manner.
Interrupt avoidance
Interrupt handling is expensive in terms of host CPU cycles.
To handle an interrupt, the system must save its prior machine state, determine where the interrupt is
coming from, perform various housekeeping tasks, and call the proper device driver interrupt handler. The
# ifconfig en0
en0: flags=5e080863,e0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,THREAD,CHAIN>
inet 192.1.0.1 netmask 0xffffff00 broadcast 192.1.0.255
# ifconfig en0
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,THREAD,CHAIN>
inet 192.1.0.1 netmask 0xffffff00 broadcast 192.1.0.255
The netstat -s command also displays some counters to show the number of packets processed by
threads and if the thread queues dropped any incoming packets. The following is an example of the
netstat -s command:
For jumbo frame mode, the default ISNO values for tcp_sendspace, tcp_recvspace, and rfc1323 are set
as follows:
# ifconfig en0
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
inet 192.0.0.1 netmask 0xffffff00 broadcast 192.0.0.255
tcp_sendspace 262144 tcp_recvspace 131072 rfc1323 1
# smitty tcpip
Notice that the ISNO system defaults do not display, even thought they are set internally. For this
example, override the default value for tcp_sendspace and lower it down to 65536.
Bring the interface back up with smitty tcpip and select Minimum Configuration and Startup.
Then select en0, and take the default values that were set when the interface was first setup.
If you use the ifconfig command to show the ISNO options, you can see that the value of the
tcp_sendspace attribute is now set to 65536. The following is an example:
# ifconfig en0
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
inet 192.0.0.1 netmask 0xffffff00 broadcast 192.0.0.255
tcp_sendspace 65536 tcp_recvspace 65536
The lsattr command output also shows that the system default has been overridden for this attribute:
# lsattr -E -l en0
alias4 IPv4 Alias including Subnet Mask True
alias6 IPv6 Alias including Prefix Length True
arp on Address Resolution Protocol (ARP) True
authority Authorized Users True
broadcast Broadcast Address True
mtu 1500 Maximum IP Packet Size for This Device True
netaddr 192.0.0.1 Internet Address True
netaddr6 IPv6 Internet Address True
netmask 255.255.255.0 Subnet Mask True
prefixlen Prefix Length for IPv6 Internet Address True
remmtu 576 Maximum IP Packet Size for REMOTE Networks True
rfc1323 Enable/Disable TCP RFC 1323 Window Scaling True
security none Security Level True
state up Current Interface Status True
tcp_mssdflt Set TCP Maximum Segment Size True
tcp_nodelay Enable/Disable TCP_NODELAY Option True
tcp_recvspace Set Socket Buffer Space for Receiving True
tcp_sendspace 65536 Set Socket Buffer Space for Sending True
Modifying the ISNO options with the chdev and ifconfig commands
You can use the following commands to first verify system and interface support and then to set and verify
the new values.
• Make sure the use_isno option is enabled by using the following command:
# no -a | grep isno
use_isno = 1
• Make sure the interface supports the five new ISNOs by using the lsattr -El command:
# lsattr -E -l en0 -H
attribute value description user_settable
:
rfc1323 Enable/Disable TCP RFC 1323 Window Scaling True
tcp_mssdflt Set TCP Maximum Segment Size True
tcp_nodelay Enable/Disable TCP_NODELAY Option True
tcp_recvspace Set Socket Buffer Space for Receiving True
tcp_sendspace Set Socket Buffer Space for Sending True
or
or
tcp_sendspac tcp_recvspac
Device Speed MTU size e e sb_max 1 rfc1323
Token Ring 4 or 16 Mbit 1492 16384 16384 32768 0
(1)It is suggested to use the default value of 1048576 for the sb_max tunable. The values shown in the
table are acceptable minimum values for the sb_max tunable.
(2) Performance is slightly better when using these options, with rfc1323 enabled, on jumbo frames on
Gigabit Ethernet.
(3)Certain combinations of TCP send and receive space will result in very low throughput, (1 Mbit or less).
To avoid this problem, set the tcp_sendspace tunable to a minimum of three times the MTU size or greater
or equal to the receiver's tcp_recvspace value.
(4) TCP has only a 16-bit value to use for its window size. This translates to a maximum window
size of 65536 bytes. For adapters that have large MTU sizes (for example 32 KB or 64 KB), TCP
streaming performance might be very poor. For example, on a device with a 64 KB MTU size, and with a
tcp_recvspace set to 64 KB, TCP can only send one packet and then its window closes. It must wait for an
ACK back from the receiver before it can send again. This problem can be solved in one of the following
ways:
• Enable rfc1323, which enhances TCP and allows it to overcome the 16-bit limit so that it can use a
window size larger than 64 KB. You can then set the tcp_recvspace tunable to a large value, such as 10
times the MTU size, which allows TCP to stream data and thus provides good performance.
• Reduce the MTU size of the adapter. For example, use the ifconfig at0 mtu 16384 command to set
the ATM MTU size to 16 KB. This causes TCP to compute a smaller MSS value. With a 16 KB MTU size,
TCP can send four packets for a 64 KB window size.
The following are general guidelines for tuning TCP streaming workloads:
• Set the TCP send and receive space to at least 10 times the MTU size.
• You should enable rfc1323 when MTU sizes are above 8 KB to allow larger TCP receive space values.
• For high speed adapters, larger TCP send and receive space values help performance.
• For high speed adapters, the tcp_sendspace tunable value should be 2 times the value of tcp_recvspace.
• The rfc1323 for the lo0 interface is set by default. The default MTU size for lo0 is higher than 1500, so
the tcp_sendspace and tcp_recvspace tunables are set to 128K.
The ftp and rcp commands are examples of TCP applications that benefit from tuning the tcp_sendspace
and tcp_recvspace tunables.
Dividing the capacity value by 8 provides a good estimate of the TCP window size needed to keep the
network pipeline full. The longer the round-trip delay and the faster the network speed, the larger the
bandwidth-delay product value, and thus the larger the TCP window. An example of this is a 100 Mbit
network with a round trip time of 0.2 milliseconds. You can calculate the bandwidth-delay product value
with the formula above:
Thus, in this example, the TCP window size needs to be at least 2500 bytes. On 100 Mbit and Gigabit
Ethernet on a single LAN, you might want to set the tcp_recvspace and tcp_sendspace tunable values to at
least 2 or 3 times the computed bandwidth-delay product value for best performance.
# pmtu display
-------------------------------------------------------------------------
Unused PMTU entries, which are refcnt entries with a value of 0, are deleted to prevent the PMTU table
from getting too large. The unused entries are deleted pmtu_expire minutes after the refcnt value equals
0. The pmtu_expire network option has a default value of 10 minutes. To prevent PMTU entries from
expiring, you can set the pmtu_expire value to 0.
Route cloning is unnecessary with this implementation of TCP path MTU discovery, which means the
routing table is smaller and more manageable.
lsattr -E -l entX
Table 7. Adapters and their available options, and system default settings
TCP checksum Default TCP large Default
Adapter type Feature code offload setting send setting
GigE, PCI, SX & TX 2969, 2975 Yes OFF Yes OFF
GigE, PCI-X, SX and TX 5700, 5701 Yes ON Yes ON
GigE dual port PCI-X, TX and 5706, 5707 Yes ON Yes ON
SX
10 GigE PCI-X LR and SR 5718, 5719 Yes ON Yes ON
10/100 Ethernet 4962 Yes ON Yes OFF
ATM 155, UTP & MMF 4953, 4957 Yes (transmit only) ON No N/A
ATM 622, MMF 2946 Yes ON No N/A
Interrupt coalescing
To avoid flooding the host system with too many interrupts, packets are collected and one single interrupt
is generated for multiple packets. This is called interrupt coalescing.
For receive operations, interrupts typically inform the host CPU that packets have arrived on the device's
input queue. Without some form of interrupt moderation logic on the adapter, this might lead to an
interrupt for each incoming packet. However, as the incoming packet rate increases, the device driver
finishes processing one packet and checks to see if any more packets are on the receive queue before
exiting the driver and clearing the interrupt. The driver then finds that there are more packets to handle
UDP tuning
User Datagram Protocol (UDP) is a datagram protocol that is used by Network File System (NFS), name
server (named), Trivial File Transfer Protocol (TFTP), and other special purpose protocols.
Since UDP is a datagram protocol, the entire message (datagram) must be copied into the kernel on a
send operation as one atomic operation. The datagram is also received as one complete message on the
recv or recvfrom system call. You must set the udp_sendspace and udp_recvspace parameters to handle
the buffering requirements on a per-socket basis.
The largest UDP datagram that can be sent is 64 KB, minus the UDP header size (8 bytes) and the IP
header size (20 bytes for IPv4 or 40 bytes for IPv6 headers).
The following tunables affect UDP performance:
• udp_sendspace
• udp_recvspace
• UDP packet chaining
6*2048=12,288 bytes
Thus, you can see that the udp_recvspace must be adjusted higher depending on how efficient the
incoming buffering is. This will vary by datagram size and by device driver. Sending a 64 byte datagram
would consume a 2 KB buffer for each 64 byte datagram.
Then, you must account for the number of datagrams that may be queued onto this one socket. For
example, NFS server receives UDP packets at one well-known socket from all clients. If the queue depth
of this socket could be 30 packets, then you would use 30 * 12,288 = 368,640 for the udp_recvspace if
NFS is using 8 KB datagrams. NFS Version 3 allows up to 32 KB datagrams.
A suggested starting value for udp_recvspace is 10 times the value of udp_sendspace, because UDP may
not be able to pass a packet to the application before another one arrives. Also, several nodes can send
to one node at the same time. To provide some staging space, this size is set to allow 10 packets to be
staged before subsequent packets are discarded. For large parallel applications using UDP, the value may
have to be increased.
Note: The value of sb_max, which specifies the maximum socket buffer size for any socket buffer, should
be at least twice the size of the largest of the UDP and TCP send and receive buffers.
# ifconfig en0
en0: flags=5e080863,80<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG>
inet 192.1.6.1 netmask 0xffffff00 broadcast 192.1.6.255
tcp_sendspace 65536 tcp_recvspace 65536 tcp_nodelay 1
# ifconfig en0
en0: flags=5e080863,80<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
inet 192.1.6.1 netmask 0xffffff00 broadcast 192.1.6.255
tcp_sendspace 65536 tcp_recvspace 65536 tcp_nodelay 1
Interrupt coalescing
To avoid flooding the host system with too many interrupts, packets are collected and one single interrupt
is generated for multiple packets. This is called interrupt coalescing.
For receive operations, interrupts typically inform the host CPU that packets have arrived on the device's
input queue. Without some form of interrupt moderation logic on the adapter, this might lead to an
interrupt for each incoming packet. However, as the incoming packet rate increases, the device driver
finishes processing one packet and checks to see if any more packets are on the receive queue before
exiting the driver and clearing the interrupt. The driver then finds that there are more packets to handle
and ends up handling multiple packets per interrupt as the packet rate increases, which means that the
system gets more efficient as the load increases.
However, some adapters provide additional features that can provide even more control on when receive
interrupts are generated. This is often called interrupt coalescing or interrupt moderation logic, which
allows several packets to be received and to generate one interrupt for several packets. A timer starts
when the first packet arrives, and then the interrupt is delayed for n microseconds or until m packets
arrive. The methods vary by adapter and by which of the features the device driver allows the user to
control.
Under light loads, interrupt coalescing adds latency to the packet arrival time. The packet is in host
memory, but the host is not aware of the packet until some time later. However, under higher packet
loads, the system performs more efficiently by using fewer CPU cycles because fewer interrupts are
generated and the host processes several packets per interrupt.
For AIX adapters that include the interrupt moderation feature, you should set the values to a moderate
level to reduce the interrupt overhead without adding large amounts of latency. For applications that
might need minimum latency, you should disable or change the options to allow more interrupts per
second for lower latency.
The Gigabit Ethernet adapters offer the interrupt moderation features. The FC 2969 and FC 2975 GigE
PCI adapters provide a delay value and a buffer count method. The adapter starts a timer when the first
packet arrives and then an interrupt occurs either when the timer expires or when n buffers in the host
have been used.
The FC 5700, FC 5701, FC 5706, and FC 5707 GigE PCI-X adapters use the interrupt throttle rate method,
which generates interrupts at a specified frequency that allows for the bunching of packets based on time.
The default interrupt rate is 10 000 interrupts per second. For lower interrupt overhead, you can set the
interrupt rate to a minimum of 2 000 interrupts per second. For workloads that call for lower latency
and faster response time, you can set the interrupt rate to a maximum of 20 000 interrupts. Setting the
interrupt rate to 0 disables the interrupt throttle completely.
Transmit queues
For transmit, the device drivers can provide a transmit queue limit.
There can be both hardware queue and software queue limits, depending on the driver and adapter.
Some drivers have only a hardware queue; some have both hardware and software queues. Some drivers
internally control the hardware queue and only allow the software queue limits to be modified. Generally,
the device driver will queue a transmit packet directly to the adapter hardware queue. If the system CPU
is fast relative to the speed of the network, or on an SMP system, the system may produce transmit
packets faster than they can be transmitted over the network. This will cause the hardware queue to fill.
After the hardware queue is full, some drivers provide a software queue and they will then queue to the
software queue. If the software transmit queue limit is reached, then the transmit packets are discarded.
This can affect performance because the upper-level protocols must then time out and retransmit the
packet. At some point, however, the adapter must discard packets as providing too much space can result
in stale packets being sent.
For adapters that provide hardware queue limits, changing these values will cause more real memory
to be consumed on receives because of the control blocks and buffers associated with them. Therefore,
raise these limits only if needed or for larger systems where the increase in memory use is negligible.
For the software transmit queue limits, increasing these limits does not increase memory usage. It only
allows packets to be queued that were already allocated by the higher layer protocols.
Transmit descriptors
Some drivers allow you to tune the size of the transmit ring or the number of transmit descriptors.
The hardware transmit queue controls the maximum number of buffers that can be queued to the adapter
for concurrent transmission. One descriptor typically only points to one buffer and a message might be
sent in multiple buffers. Many drivers do not allow you to change the parameters.
Receive resources
Some adapters allow you to configure the number of resources used for receiving packets from the
network. This might include the number of receive buffers (and even their size) or the number of DMA
receive descriptors.
Some drivers have multiple receive buffer pools with buffers of different sizes that might need to be tuned
for different workloads. Some drivers manage these resources internal to the driver and do not allow you
to change them.
The receive resources might need to be increased to handle peak bursts on the network. The network
interface device driver places incoming packets on a receive queue. If the receive descriptor list or ring is
full, or no buffers are available, packets are dropped, resulting in the sender needing to retransmit. The
receive descriptor queue is tunable using the SMIT tool or the chdev command (see “Changing network
parameters” on page 261). The maximum queue size is specific to each type of communication adapter
and can normally be viewed using the F4 or List key in the SMIT tool.
Gigabit Ethernet PCI (SX or 2969, 2975 rx_queue_size 512 512 (fixed)
TX)
Gigabit Ethernet PCI-X (SX 5700, 5701, 5706, rxbuf_pool_sz 2048 512-16384,1
or TX) 5707,5717, 5768,
rxdesc_que_sz 1024 128-3840,12
5271, 5274, 5767,
8
and 5281
10 Gigabit PCI-X (SR or LR) 5718, 5719 rxdesc_que_sz 1024 128-1024, by
128
rxbuf_pool_sz 2048
512-2048
Note:
1. The ATM adapter's rx_buf4k_max attribute is the maximum number of buffers in the receive buffer
pool. When the value is set to 0, the driver assigns a number based on the amount of memory on
the system (rx_buf4k_max= thewall * 6 / 320, for example), but with upper limits of 9500 buffers
for the ATM 155 adapter and 16360 buffers for the ATM 622 adapter. Buffers are released (down to
rx_buf4k_min) when not needed.
2. The ATM adapter's rx_buf4k_min attribute is the minimum number of free buffers in the pool. The
driver tries to keep only this amount of free buffers in the pool. The pool can expand up to the
rx_buf4k_max value.
# lsattr -E -l atm0
adapter_clock 0 Provide SONET Clock True
alt_addr 0x0 ALTERNATE ATM MAC address (12 hex digits) True
busintr 99 Bus Interrupt Level False
interface_type 0 Sonet or SDH interface True
intr_priority 3 Interrupt Priority False
max_vc 1024 Maximum Number of VCs Needed True
min_vc 64 Minimum Guaranteed VCs Supported True
regmem 0xe0008000 Bus Memory address of Adapter Registers False
rx_buf4k_max 0 Maximum 4K-byte pre-mapped receive buffers True
rx_buf4k_min 256 Minimum 4K-byte pre-mapped receive buffers True
rx_checksum yes Enable Hardware Receive Checksum True
rx_dma_mem 0x4000000 Receive bus memory address range False
sw_txq_size 2048 Software Transmit Queue size True
tx_dma_mem 0x2000000 Transmit bus memory address range False
uni_vers auto_detect SVC UNI Version True
use_alt_addr no Enable ALTERNATE ATM MAC address True
virtmem 0xe0000000 Bus Memory address of Adapter Virtual Memory False
Following is an example of the settings of a PCI-X Gigabit Ethernet adapter using the lsattr -E -l
ent0 command. This output shows the tx_que_size set to 8192, the rxbuf_pool_sz set to 2048, and the
rx_que_size set to 1024.
# lsattr -E -l ent0
[Entry Fields]
Ethernet Adapter ent2
Description 10/100/1000 Base-TX PCI-X Adapter (14106902)
Status Available
Location 1V-08
Receive descriptor queue size [1024] +#
An alternative method to change these parameter values is to run the following command:
For example, to change the above tx_que_size on en0 to 128, use the following sequence of commands.
Note that this driver only supports four different sizes, so it is better to use the SMIT command to see
these values.
The TCP size is 20 bytes, the IPv4 header size is 20 bytes, and the IPv6 header size is 40 bytes.
Because this is the largest possible MSS that can be accommodated without IP fragmentation, this value
is inherently optimal, so no MSS-tuning is required for local networks.
# pmtu display
-------------------------------------------------------------------------
Unused PMTU entries, which are refcnt entries with a value of 0, are deleted to prevent the PMTU table
from getting too large. The unused entries are deleted pmtu_expire minutes after the refcnt value equals
0. The pmtu_expire network option has a default value of 10 minutes. To prevent PMTU entries from
expiring, you can set the pmtu_expire value to 0.
Static routes
You can override the default MSS value of 1460 bytes by specifying a static route to a specific remote
network.
Use the -mtu option of the route command to specify the MTU to that network. In this case, you would
specify the actual minimum MTU of the route, rather than calculating an MSS value. For example, the
following command sets the default MTU size to 1500 for a route to network 192.3.3 and the default host
to get to that gateway is en0host2:
The netstat -r command displays the route table and shows that the PMTU size is 1500 bytes. TCP
computes the MSS from that MTU size. The following is an example of the netstat -r command:
# netstat -r
Routing tables
Destination Gateway Flags Refs Use If PMTU Exp Groups
Note: The netstat -r command does not display the PMTU value. You can view the PMTU value with the
pmtu display command. When you add a route for a destination with the route add command and
you specify the MTU value, a PMTU entry is created in the PMTU table for that destination.
In a small, stable environment, this method allows precise control of MSS on a network-by-network basis.
The disadvantages of this approach are as follows:
• It does not work with dynamic routing.
• It becomes impractical when the number of remote networks increases.
• Static routes must be set at both ends to ensure that both ends negotiate with a larger-than-default
MSS.
# no -o ipqmaxlen=100
# netstat -m
# netstat -p arp
arp:
6 packets sent
0 packets purged
You can display the ARP table with the arp -a command. The command output shows those addresses
that are in the ARP table and how those addresses are hashed and to what buckets.
...lines omitted...
hosts=value,value,value
where value may be (lowercase only) bind, local, nis, bind4, bind6, local4, local6, nis4, or
nis6 (for /etc/hosts). The order is specified on one line with values separated by commas. White
spaces are permitted between the commas and the equal sign.
The values specified and their ordering is dependent on the network configuration. For example, if
the local network is organized as a flat network, then only the /etc/hosts file is needed. The /etc/
netsvc.conf file would contain the following line:
hosts=local
NSORDER=local
• If the local network is a domain network using a name server for name resolution and an /etc/hosts
file for backup, specify both services. The /etc/netsvc.conf file would contain the following line:
hosts=bind,local
NSORDER=bind,local
The algorithm attempts the first source in the list. The algorithm will then determine to try another
specified service based on:
• Current service is not running; therefore, it is unavailable.
• Current service could not find the name and is not authoritative.
ping command
The ping command is useful for determining the status of the network and various foreign hosts, tracking
and isolating hardware and software problems, and testing, measuring, and managing networks
Some ping command options relevant to performance tuning are as follows:
-c
Specifies the number of packets. This option is useful when you get an IP trace log. You can capture a
minimum of ping packets.
-s
Specifies the length of packets. You can use this option to check fragmentation and reassembly.
-f
Sends the packets at 10 ms intervals or immediately after each response. Only the root user can use
this option.
If you need to load your network or systems, the -f option is convenient. For example, if you suspect that
your problem is caused by a heavy load, load your environment intentionally to confirm your suspicion.
Open several aixterm windows and run the ping -f command in each window. Your Ethernet utilization
quickly gets to around 100 percent. The following is an example:
Note: The ping command can be very hard on a network and should be used with caution. Flood-pinging
can only be performed by the root user.
In this example, 1000 packets were sent within 1 second. Be aware that this command uses IP and
Internet Control Message Protocol (ICMP) protocol and therefore, no transport protocol (UDP/TCP) and
application activities are involved. The measured data, such as round-trip time, does not reflect the total
performance characteristics.
When you try to send a flood of packets to your destination, consider several points:
• Sending packets puts a load on your system.
• Use the netstat -i command to monitor the status of your network interface during the experiment.
You may find that the system is dropping packets during a send by looking at the Oerrs output.
• You should also monitor other resources, such as mbufs and send/receive queue. It can be difficult
to place a heavy load onto the destination system. Your system might be overloaded before the other
system is.
ftp command
You can use the ftp command to send a very large file by using /dev/zero as input and /dev/null as
output. This allows you to transfer a large file without involving disks (which might be a bottleneck) and
without having to cache the entire file in memory.
Use the following ftp subcommands (change count to increase or decrease the number of blocks read by
the dd command):
> bin
> put "|dd if=/dev/zero bs=32k count=10000" /dev/null
The above command transfers 10000 blocks of data and each block is 32 KB in size. To increase or
decrease the size of the file transferred, change the count of blocks read by the dd command, which is the
count parameter, or by changing the block size, which is the bs parameter. Note that the default file type
for the ftp command is ASCII, which is slower since all bytes have to be scanned. The binary mode, or
bin should be used for transfers whenever possible.
Make sure that tcp_sendspace and tcp_recvspace are at least 65535 for the Gigabit Ethernet "jumbo
frames" and for the ATM with MTU 9180 or larger to get good performance due to larger MTU size. A
size of 131072 bytes (128 KB) is recommended for optimal performance. If you configure your Gigabit
Ethernet adapters with the SMIT tool, the ISNO system default values should be properly set. The ISNO
options do not get properly set if you use the ifconfig command to bring up the network interfaces.
An example to set the parameters is as follows:
# no -o tcp_sendspace=65535
# no -o tcp_recvspace=65535
ftp> bin
200 Type set to I.
ftp> put "|dd if=/dev/zero bs=32k count=10000" /dev/null
200 PORT command successful.
150 Opening data connection for /dev/null.
10000+0 records in
10000+0 records out
226 Transfer complete.
327680000 bytes sent in 2.789 seconds (1.147e+05 Kbytes/s)
local: |dd if=/dev/zero bs=32k count=10000 remote: /dev/null
ftp> quit
221 Goodbye.
The above data transfer was executed between two Gigabit Ethernet adapters using 1500 bytes MTU and
the throughput was reported to be : 114700 KB/sec which is the equivalent of 112 MB/sec or 940 Mbps.
When the sender and receiver used Jumbo Frames, with a MTU size of 9000, the throughput reported was
120700 KB/sec or 117.87 MB/sec or 989 Mbps, as you can see in the following example:
ftp> bin
200 Type set to I.
ftp> put "|dd if=/dev/zero bs=32k count=10000" /dev/null
200 PORT command successful.
150 Opening data connection for /dev/null.
10000+0 records in
10000+0 records out
226 Transfer complete.
327680000 bytes sent in 2.652 seconds (1.207e+05 Kbytes/s)
local: |dd if=/dev/zero bs=32k count=10000 remote: /dev/null
The following is an example of an ftp data transfer between two 10/100 Mbps Ethernet interfaces:
ftp> bin
200 Type set to I.
The throughput of the above data transfer is 11570 KB/sec which is the equivalent of 11.3 MB/sec or 94.7
Mbps.
netstat command
The netstat command is used to show network status.
Traditionally, it is used more for problem determination than for performance measurement. However, the
netstat command can be used to determine the amount of traffic on the network to ascertain whether
performance problems are due to network congestion.
The netstat command displays information regarding traffic on the configured network interfaces, such
as the following:
• The address of any protocol control blocks associated with the sockets and the state of all sockets
• The number of packets received, transmitted, and dropped in the communications subsystem
• Cumulative statistics per interface
• Routes and their status
# netstat -in
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en1 1500 link#2 0.9.6b.3e.0.55 28800 0 506 0 0
en1 1500 10.3.104 10.3.104.116 28800 0 506 0 0
fc0 65280 link#3 0.0.c9.33.17.46 12 0 11 0 0
fc0 65280 192.6.0 192.6.0.1 12 0 11 0 0
en0 1500 link#4 0.2.55.6a.a5.dc 14 0 20 5 0
en0 1500 192.1.6 192.1.6.1 14 0 20 5 0
lo0 16896 link#1 33339 0 33343 0 0
lo0 16896 127 127.0.0.1 33339 0 33343 0 0
Then increase the send queue size (xmt_que_size) for that interface. The size of the xmt_que_size could
be checked with the following command:
Then there is a high network utilization, and a reorganization or partitioning may be necessary. Use the
netstat -v or entstat command to determine the collision rate.
netstat -i -Z command
This function of the netstat command clears all the statistic counters for the netstat -i command to
zero.
# netstat -I en0 1
input (en0) output input (Total) output
packets errs packets errs colls packets errs packets errs colls
0 0 27 0 0 799655 0 390669 0 0
0 0 0 0 0 2 0 0 0 0
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 78 0 254 0 0
0 0 0 0 0 200 0 62 0 0
0 0 1 0 0 0 0 2 0 0
The previous example shows the netstat -I command output for the ent0 interface. Two reports are
generated side by side, one for the specified interface and one for all available interfaces (Total). The
fields are similar to the ones in the netstat -i example, input packets = Ipkts, input errs =
Ierrs and so on.
# netstat -a
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 0 *.daytime *.* LISTEN
tcp 0 0 *.ftp *.* LISTEN
tcp 0 0 *.telnet *.* LISTEN
tcp4 0 0 *.time *.* LISTEN
tcp4 0 0 *.sunrpc *.* LISTEN
tcp 0 0 *.exec *.* LISTEN
tcp 0 0 *.login *.* LISTEN
tcp 0 0 *.shell *.* LISTEN
tcp4 0 0 *.klogin *.* LISTEN
tcp4 0 0 *.kshell *.* LISTEN
tcp 0 0 *.netop *.* LISTEN
tcp 0 0 *.netop64 *.* LISTEN
tcp4 0 1028 brown10.telnet remote_client.mt.1254 ESTABLISHED
tcp4 0 0 *.wsmserve *.* LISTEN
udp4 0 0 *.daytime *.*
udp4 0 0 *.time *.*
udp4 0 0 *.sunrpc *.*
udp4 0 0 *.ntalk *.*
udp4 0 0 *.32780 *.*
Active UNIX domain sockets
SADR/PCB Type Recv-Q Send-Q Inode Conn Refs Nextref Addr
71759200 dgram 0 0 13434d00 0 0 0 /dev/SRC
7051d580
71518a00 dgram 0 0 183c3b80 0 0 0 /dev/.SRC-unix/SRCCwfCEb
You can view detailed information for each socket with the netstat -ao command. In the following
example, the ftp socket runs over a Gigabit Ethernet adapter configured for jumbo frames:
# netstat -ao
so_options: (REUSEADDR|OOBINLINE)
so_state: (ISCONNECTED|PRIV)
timeo:0 uid:0
so_special: (LOCKBALE|MEMCOMPRESS|DISABLE)
so_special2: (PROC)
sndbuf:
hiwat:134220 lowat:33555 mbcnt:0 mbmax:536880
rcvbuf:
hiwat:134220 lowat:1 mbcnt:0 mbmax:536880
sb_flags: (WAIT)
TCP:
mss:8948 flags: (NODELAY|RFC1323|SENT_WS|RCVD_WS|SENT_TS|RCVD_TS)
so_options: (REUSEADDR|KEEPALIVE|OOBINLINE)
so_state: (ISCONNECTED|NBIO)
timeo:0 uid:0
so_special: (NOUAREA|LOCKBALE|EXTPRIV|MEMCOMPRESS|DISABLE)
so_special2: (PROC)
sndbuf:
hiwat:16384 lowat:4125 mbcnt:0 mbmax:65536
sb_flags: (SEL|NOINTR)
rcvbuf:
hiwat:66000 lowat:1 mbcnt:0 mbmax:264000
sb_flags: (SEL|NOINTR)
TCP:
mss:1375
so_options: (REUSEADDR|KEEPALIVE|OOBINLINE)
so_state: (ISCONNECTED|NBIO)
timeo:0 uid:0
so_options: (ACCEPTCONN|REUSEADDR)
q0len:0 qlen:0 qlimit:1000 so_state: (PRIV)
timeo:0 uid:0
so_special: (LOCKBALE|MEMCOMPRESS|DISABLE)
so_special2: (PROC)
sndbuf:
hiwat:16384 lowat:4096 mbcnt:0 mbmax:65536
rcvbuf:
hiwat:16384 lowat:1 mbcnt:0 mbmax:65536
sb_flags: (SEL)
TCP:
mss:512
so_options: (ACCEPTCONN|REUSEADDR)
q0len:0 qlen:0 qlimit:1000 so_state: (PRIV)
timeo:0 uid:0
so_special: (LOCKBALE|MEMCOMPRESS|DISABLE)
so_special2: (PROC)
sndbuf:
hiwat:16384 lowat:4096 mbcnt:0 mbmax:65536
rcvbuf:
hiwat:16384 lowat:1 mbcnt:0 mbmax:65536
sb_flags: (SEL)
TCP:
mss:512
so_options: (REUSEADDR|KEEPALIVE|OOBINLINE)
so_state: (ISCONNECTED|NBIO)
timeo:0 uid:0
so_special: (NOUAREA|LOCKBALE|EXTPRIV|MEMCOMPRESS|DISABLE)
so_special2: (PROC)
sndbuf:
hiwat:16384 lowat:4125 mbcnt:65700 mbmax:65536
sb_flags: (SEL|NOINTR)
rcvbuf:
hiwat:16500 lowat:1 mbcnt:0 mbmax:66000
sb_flags: (SEL|NOINTR)
TCP:
mss:1375
so_options: (REUSEADDR)
so_state: (PRIV)
timeo:0 uid:0
so_special: (LOCKBALE|DISABLE)
so_special2: (PROC)
sndbuf:
hiwat:9216 lowat:4096 mbcnt:0 mbmax:36864
rcvbuf:
hiwat:42080 lowat:1 mbcnt:0 mbmax:168320
sb_flags: (SEL)
[...]
[...]
In the above example, the adapter is configured for jumbo frames which is the reason for the large MSS
value and the reason that rfc1323 is set.
netstat -M command
The netstat -M command displays the network memory's cluster pool statistics.
The following example shows the output of the netstat -M command:
# netstat -M
Cluster pool Statistics:
netstat -v command
The netstat -v command displays the statistics for each Common Data Link Interface (CDLI)-based
device driver that is in operation.
Interface-specific reports can be requested using the tokstat, entstat, fddistat, or atmstat
commands.
Every interface has its own specific information and some general information. The following example
shows the Token-Ring and Ethernet part of the netstat -v command; other interface parts are similar.
With a different adapter, the statistics will differ somewhat. The most important output fields are
highlighted.
# netstat -v
-------------------------------------------------------------
ETHERNET STATISTICS (ent1) :
Device Type: 10/100 Mbps Ethernet PCI Adapter II (1410ff01)
Hardware Address: 00:09:6b:3e:00:55
Elapsed Time: 0 days 17 hours 38 minutes 35 seconds
General Statistics:
-------------------
No mbuf Errors: 0
Adapter Reset Count: 0
Adapter Data Rate: 200
Driver Flags: Up Broadcast Running
Simplex AlternateAddress 64BitSupport
ChecksumOffload PrivateSegment DataRateSet
General Statistics:
-------------------
No mbuf Errors: 0
Adapter Reset Count: 0
Adapter Data Rate: 2000
Driver Flags: Up Broadcast Running
Simplex 64BitSupport ChecksumOffload
PrivateSegment LargeSend DataRateSet
If the result is greater than 5 percent, reorganize the network to balance the load.
• Another indication for a high network load is (from the command netstat -v):
If the total number of collisions from the netstat -v output (for Ethernet) is greater than 10 percent of
the total transmitted packets, as follows:
netstat -p protocol
The netstat -p protocol shows statistics about the value specified for the protocol variable (udp, tcp,
sctp,ip, icmp), which is either a well-known name for a protocol or an alias for it.
Some protocol names and aliases are listed in the /etc/protocols file. A null response indicates that
there are no numbers to report. If there is no statistics routine for it, the program report of the value
specified for the protocol variable is unknown.
The following example shows the output for the ip protocol:
# netstat -p ip
ip:
# netstat -p udp
udp:
11623 datagrams received
0 incomplete headers
0 bad data length fields
0 bad checksums
620 dropped due to no socket
10989 broadcast/multicast datagrams dropped due to no socket
0 socket buffer overflows
14 delivered
12 datagrams output
# netstat -p tcp
tcp:
576 packets sent
512 data packets (62323 bytes)
0 data packets (0 bytes) retransmitted
55 ack-only packets (28 delayed)
0 URG only packets
0 window probe packets
0 window update packets
9 control packets
0 large sends
0 bytes sent using largesend
0 bytes is the biggest largesend
719 packets received
504 acks (for 62334 bytes)
19 duplicate acks
0 acks for unsent data
449 packets (4291 bytes) received in-sequence
8 completely duplicate packets (8 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
5 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
2 window update packets
0 packets received after close
netstat -s -s
The undocumented -s -s option shows only those lines of the netstat -s output that are not zero,
making it easier to look for error counts.
netstat -s -Z
The netstat command clears all the statistic counters for the netstat -s command to zero.
netstat -r
Another option relevant to performance is the display of the discovered Path Maximum Transmission Unit
(PMTU). Use the netstat -r command to display this value.
For two hosts communicating across a path of multiple networks, a transmitted packet will become
fragmented if its size is greater than the smallest MTU of any network in the path. Because packet
fragmentation can result in reduced network performance, it is desirable to avoid fragmentation by
transmitting packets with a size no larger than the smallest MTU in the network path. This size is called
the path MTU.
The following is an example of the netstat -r -f inet command used to display only the routing tables:
# netstat -r -f inet
Routing tables
Destination Gateway Flags Refs Use If PMTU Exp Groups
netstat -D
The -D option allows you to see packets coming into and going out of each layer in the communications
subsystem along with packets dropped at each layer.
# netstat -D
The Devices layer shows number of packets coming into the adapter, going out of the adapter, and
number of packets dropped on input and output. There are various causes of adapter errors, and the
netstat -v command can be examined for more details.
The Drivers layer shows packet counts handled by the device driver for each adapter. Output of the
netstat -v command is useful here to determine which errors are counted.
The Demuxer values show packet counts at the demux layer, and Idrops here usually indicate that
filtering has caused packets to be rejected (for example, Netware or DecNet packets being rejected
because these are not handled by the system under examination).
Details for the Protocols layer can be seen in the output of the netstat -s command.
Note: In the statistics output, a N/A displayed in a field value indicates the count is not applicable. For the
NFS/RPC statistics, the number of incoming packets that pass through RPC are the same packets which
pass through NFS, so these numbers are not summed in the NFS/RPC Total field, hence the N/A. NFS
has no outgoing packet or outgoing packet drop counters specific to NFS and RPC. Therefore, individual
counts have a field value of N/A, and the cumulative count is stored in the NFS/RPC Total field.
netpmon command
The netpmon command uses the trace facility to obtain a detailed picture of network activity during a
time interval. Because it uses the trace facility, the netpmon command can be run only by a root user or
by a member of the system group.
The netpmon command cannot run together with any of the other trace-based performance commands
such as tprof and filemon. In its usual mode, the netpmon command runs in the background while
one or more application programs or system commands are being executed and monitored.
Tracing is started by the netpmon command, optionally suspended with the trcoff subcommand and
resumed with the trcon subcommand. As soon as tracing is terminated, the netpmon command writes
its report to standard output.
At this point, an adjusted trace logfile is fed into the netpmon command to report on I/O activity captured
by a previously recorded trace session as follows:
In this example, the netpmon command reads file system trace events from the trace.rpt input file.
Because the trace data is already captured on a file, the netpmon command does not put itself in the
background to allow application programs to be run. After the entire file is read, a network activity report
will be displayed on standard output (which, in this example, is piped to the pg command).
If the trace command was run with the -C all flag, then run the trcrpt command also with the -C all
flag (see “Formatting a report from trace -C output ” on page 364).
The following netpmon command running on an NFS server executes the sleep command and creates a
report after 400 seconds. During the measured interval, a copy to an NFS-mounted file system /nfs_mnt
is taking place.
With the -O option, you can specify the report type to be generated. Valid report type values are:
cpu
CPU usage
dd
Network device-driver I/O
so
Internet socket call I/O
nfs
NFS I/O
all
All reports are produced. The following is the default value.
# cat netpmon.out
========================================================================
========================================================================
========================================================================
========================================================================
========================================================================
========================================================================
========================================================================
SLIH: phxentdd32
count: 33256
cpu time (msec): avg 0.074 min 0.018 max 288.374 sdev 1.581
========================================================================
DEVICE: ethernet 4
recv packets: 33003
recv sizes (bytes): avg 73.2 min 60 max 618 sdev 43.8
recv times (msec): avg 0.000 min 0.000 max 0.005 sdev 0.000
demux times (msec): avg 0.060 min 0.004 max 288.360 sdev 1.587
xmit packets: 61837
xmit sizes (bytes): avg 1514.0 min 1349 max 1514 sdev 0.7
xmit times (msec): avg 3.773 min 2.026 max 293.112 sdev 8.947
========================================================================
========================================================================
CLIENT: client_machine
other calls: 2744
other times (msec): avg 0.192 min 0.075 max 0.311 sdev 0.025
The output of the netpmon command is composed of two different types of reports: global and detailed.
The global reports list statistics as follows:
• Most active processes
• First-level interrupt handlers
• Second-level interrupt handlers
• Network device drivers
• Network device-driver transmits
• TCP socket calls
• NFS server or client statistics
The global reports are shown at the beginning of the netpmon output and are the occurrences during the
measured interval. The detailed reports provide additional information for the global reports. By default,
the reports are limited to the 20 most active statistics measured. All information in the reports is listed
from top to bottom as most active to least active.
or
If the network is congested already, changing the MTU or queue value will not help.
Note:
1. If transmit and receive packet sizes are small on the device driver statistics report, then increasing the
current MTU size will probably result in better network performance.
2. If system wait time due to network calls is high from the network wait time statistics for the NFS client
report, the poor performance is due to the network.
traceroute command
The traceroute command is intended for use in network testing, measurement, and management.
While the ping command confirms IP network reachability, you cannot pinpoint and improve some
isolated problems. Consider the following situation:
• When there are many hops (for example, gateways or routes) between your system and the destination,
and there seems to be a problem somewhere along the path. The destination system may have a
problem, but you need to know where a packet is actually lost.
• The ping command hangs up and does not tell you the reasons for a lost packet.
The traceroute command can inform you where the packet is located and why the route is lost. If your
packets must pass through routers and links, which belong to and are managed by other organizations or
companies, it is difficult to check the related routers through the telnet command. The traceroute
command provides a supplemental role to the ping command.
Note: The traceroute command should be used primarily for manual fault isolation. Because of the
load it imposes on the network, do not use the traceroute command during typical operations or from
automated scripts.
# traceroute aix1
trying to get source for aix1
source should be 10.53.155.187
traceroute to aix1.austin.ibm.com (10.53.153.120) from 10.53.155.187 (10.53.155.187), 30 hops
max
outgoing MTU = 1500
1 10.111.154.1 (10.111.154.1) 5 ms 3 ms 2 ms
2 aix1 (10.53.153.120) 5 ms 5 ms 5 ms
# traceroute aix1
trying to get source for aix1
source should be 10.53.155.187
traceroute to aix1.austin.ibm.com (10.53.153.120) from 10.53.155.187 (10.53.155.187), 30 hops
max
outgoing MTU = 1500
1 10.111.154.1 (10.111.154.1) 10 ms 2 ms 3 ms
2 aix1 (10.53.153.120) 8 ms 7 ms 5 ms
After the address resolution protocol (ARP) entry expired, the same command was repeated. Note that
the first packet to each gateway or destination took a longer round-trip time. This is due to the overhead
caused by the ARP. If a public-switched network (WAN) is involved in the route, the first packet consumes
a lot of memory due to a connection establishment and may cause a timeout. The default timeout for each
packet is 3 seconds. You can change it with the -w option.
The first 10 ms is due to the ARP between the source system (9.53.155.187) and the gateway
9.111.154.1. The second 8 ms is due to the ARP between the gateway and the final destination (wave). In
this case, you are using DNS, and every time before the traceroute command sends a packet, the DNS
server is searched.
# traceroute lamar
trying to get source for lamar
source should be 9.53.155.187
traceroute to lamar.austin.ibm.com (9.3.200.141) from 9.53.155.187 (9.53.155.187), 30 hops max
outgoing MTU = 1500
1 9.111.154.1 (9.111.154.1) 12 ms 3 ms 2 ms
2 9.111.154.1 (9.111.154.1) 3 ms !H * 6 ms !H
If an ICMP error message, excluding Time Exceeded and Port Unreachable, is received, it is
displayed as follows:
!H
Host Unreachable
!N
Network Unreachable
!P
Protocol Unreachable
!S
Source route failed
!F
Fragmentation needed
# traceroute chuys
trying to get source for chuys
source should be 9.53.155.187
traceroute to chuys.austin.ibm.com (9.53.155.188) from 9.53.155.187 (9.53.155.187), 30 hops max
outgoing MTU = 1500
1 * * *
2 * * *
3 * * *
^C#
If you think that the problem is due to a communication link, use a longer timeout period with the -w flag.
Although rare, all the ports queried might have been used. You can change the ports and try again.
This command starts the iptrace daemon with instructions to trace all activity on the Gigabit Ethernet
interface, en0, and place the trace data in /home/user/iptrace/log1. To stop the daemon, use the
following:
# stopsrc -s iptrace
If you did not start the iptrace daemon with the startsrc command, you must use the ps command
to find its process ID with and terminate it with the kill command.
The ipreport command is a formatter for the log file. Its output is written to standard output. Options
allow recognition and formatting of RPC packets (-r), identifying each packet with a number (-n),
and prefixing each line with a 3-character string that identifies the protocol (-s). A typical ipreport
command to format the log1 file just created (which is owned by the root user) would be as follows:
This would result in a sequence of packet reports similar to the following examples. The first packet is the
first half of a ping packet. The fields of most interest are as follows:
• The source (SRC) and destination (DST) host address, both in dotted decimal and in ASCII
• The IP packet length (ip_len)
• The indication of the higher-level protocol in use (ip_p)
Packet Number 7
ETH: ====( 98 bytes transmitted on interface en0 )==== 10:28:16.516070112
ETH: [ 00:02:55:6a:a5:dc -> 00:02:55:af:20:2b ] type 800 (IP)
IP: < SRC = 192.1.6.1 > (en6host1)
IP: < DST = 192.1.6.2 > (en6host2)
IP: ip_v=4, ip_hl=20, ip_tos=0, ip_len=84, ip_id=1789, ip_off=0
IP: ip_ttl=255, ip_sum=28a6, ip_p = 1 (ICMP)
ICMP: icmp_type=8 (ECHO_REQUEST) icmp_id=18058 icmp_seq=3
Packet Number 8
ETH: ====( 98 bytes received on interface en0 )==== 10:28:16.516251667
ETH: [ 00:02:55:af:20:2b -> 00:02:55:6a:a5:dc ] type 800 (IP)
The next example is a frame from an ftp operation. Note that the IP packet is the size of the MTU for this
LAN (1492 bytes).
Packet Number 20
ETH: ====( 1177 bytes transmitted on interface en0 )==== 10:35:45.432353167
ETH: [ 00:02:55:6a:a5:dc -> 00:02:55:af:20:2b ] type 800 (IP)
IP: < SRC = 192.1.6.1 > (en6host1)
IP: < DST = 192.1.6.2 > (en6host2)
IP: ip_v=4, ip_hl=20, ip_tos=8, ip_len=1163, ip_id=1983, ip_off=0
IP: ip_ttl=60, ip_sum=e6a0, ip_p = 6 (TCP)
TCP: <source port=32873, destination port=20(ftp-data) >
TCP: th_seq=623eabdc, th_ack=973dcd95
TCP: th_off=5, flags<PUSH | ACK>
TCP: th_win=17520, th_sum=0, th_urp=0
TCP: 00000000 69707472 61636520 322e3000 00008240 |iptrace 2.0....@|
TCP: 00000010 2e4c9d00 00000065 6e000065 74000053 |.L.....en..et..S|
TCP: 00000020 59535841 49584906 01000040 2e4c9d1e |[email protected]..|
TCP: 00000030 c0523400 0255af20 2b000255 6aa5dc08 |.R4..U. +..Uj...|
TCP: 00000040 00450000 5406f700 00ff0128 acc00106 |.E..T......(....|
TCP: 00000050 01c00106 0208005a 78468a00 00402e4c |[email protected]|
TCP: 00000060 9d0007df 2708090d 0a0b0c0d 0e0f1011 |....'...........|
TCP: 00000070 12131415 16171819 1a1b1c1d 1e1f2021 |.............. !|
TCP: 00000080 22232425 26272829 2a2b2c2d 2e2f3031 |"#$%&'()*+,-./01|
TCP: 00000090 32333435 36370000 0082402e 4c9d0000 |[email protected]...|
--------- Lots of uninteresting data omitted -----------
TCP: 00000440 15161718 191a1b1c 1d1e1f20 21222324 |........... !"#$|
TCP: 00000450 25262728 292a2b2c 2d2e2f30 31323334 |%&'()*+,-./01234|
TCP: 00000460 353637 |567 |
The ipfilter command extracts different operation headers from an ipreport output file and displays
them in a table. Some customized NFS information regarding requests and replies is also provided.
To determine whether the ipfilter command is installed and available, run the following command:
# ipfilter log1_formatted
The operation headers currently recognized are: udp, nfs, tcp, ipx, icmp. The ipfilter command has
three different types of reports, as follows:
• A single file (ipfilter.all) that displays a list of all the selected operations. The table displays
packet number, Time, Source & Destination, Length, Sequence #, Ack #, Source Port, Destination Port,
Network Interface, and Operation Type.
• Individual files for each selected header (ipfilter.udp, ipfilter.nfs, ipfilter.tcp,
ipfilter.ipx, ipfilter.icmp). The information contained is the same as ipfilter.all.
• A file nfs.rpt that reports on NFS requests and replies. The table contains: Transaction ID #, Type of
Request, Status of Request, Call Packet Number, Time of Call, Size of Call, Reply Packet Number, Time
of Reply, Size of Reply, and Elapsed millisecond between call and reply.
Adapter statistics
The commands in this section provide output comparable to the netstat -v command. They allow you
to reset adapter statistics (-r) and to get more detailed output (-d) than the netstat -v command output
provides.
# entstat ent0
-------------------------------------------------------------
ETHERNET STATISTICS (ent0) :
Device Type: 10/100/1000 Base-TX PCI-X Adapter (14106902)
Hardware Address: 00:02:55:6a:a5:dc
Elapsed Time: 1 days 18 hours 47 minutes 34 seconds
General Statistics:
-------------------
No mbuf Errors: 0
Adapter Reset Count: 0
Adapter Data Rate: 2000
Driver Flags: Up Broadcast Running
Simplex 64BitSupport ChecksumOffload
PrivateSegment LargeSend DataRateSet
no command
Use the no command and its flags to display current network values and to change options.
-a
Prints all options and current values
NFS performance
AIX provides tools and methods for Network File System (NFS) monitoring and tuning on both the server
and the client.
Related tasks
Improving NFS client large file writing performance
Writing large, sequential files over an NFS-mounted file system can cause a severe decrease in the file
transfer rate to the NFS server. In this scenario, you identify whether this situation exists and use the
steps to remedy the problem.
NFS uses Remote Procedure Calls (RPC) to communicate. RPCs are built on top of the External Data
Representation (XDR) protocol which transforms data to a generic format before transmitting and
allowing machines with different architectures to exchange information. The RPC library is a library of
procedures that allows a local (client) process to direct a remote (server) process to execute a procedure
call as if the local (client) process had executed the procedure call in its own address space. Because the
client and server are two separate processes, they no longer have to exist on the same physical system.
The portmap daemon, portmapper, is a network service daemon that provides clients with a standard
way of looking up a port number associated with a specific program. When services on a server are
requested, they register with portmap daemon as an available server. The portmap daemon then
maintains a table of program-to-port pairs.
When the client initiates a request to the server, it first contacts the portmap daemon to see where the
service resides. The portmap daemon listens on a well-known port so the client does not have to look for
it. The portmap daemon responds to the client with the port of the service that the client is requesting.
The client, upon receipt of the port number, is able to make all of its future requests directly to the
application.
The mountd daemon is a server daemon that answers a client request to mount a server's exported file
system or directory. The mountd daemon determines which file system is available by reading the /etc/
xtab file. The mount process takes place as follows:
1. Client mount makes call to server's portmap daemon to find the port number assigned to the mountd
daemon.
TCP requirement
The NFS version 4 protocol mandates the use of a transport protocol that includes congestion control for
better performance in WAN environments.
AIX does not support the use of UDP with NFS version 4.
NFS version 3
NFS version 3 is highly recommended over NFS version 2 due to inherent protocol features that can
enhance performance.
Write throughput
Applications running on client systems may periodically write data to a file, changing the file's contents.
The amount of data an application can write to stable storage on the server over a period of time is
a measurement of the write throughput of a distributed file system. Write throughput is therefore an
important aspect of performance. All distributed file systems, including NFS, must ensure that data is
safely written to the destination file while at the same time minimizing the impact of server latency on
write throughput.
The NFS version 3 protocol offers a better alternative to increasing write throughput by eliminating the
synchronous write requirement of NFS version 2 while retaining the benefits of close-to-open semantics.
The NFS version 3 client significantly reduces the latency of write operations to the server by writing the
data to the server's cache file data (main memory), but not necessarily to disk. Subsequently, the NFS
client issues a commit operation request to the server that ensures that the server has written all the data
to stable storage. This feature, referred to as safe asynchronous writes, can vastly reduce the number of
disk I/O requests on the server, thus significantly improving write throughput.
The writes are considered "safe" because status information on the data is maintained, indicating whether
it has been stored successfully. Therefore, if the server crashes before a commit operation, the client will
know by looking at the status indication whether to resubmit a write request when the server comes back
up.
nfsstat command
The nfsstat command displays statistical information about the NFS and the RPC interface to the kernel
for clients and servers.
This command could also be used to re-initialize the counters for these statistics (nfsstat -z). For
performance issues, the RPC statistics (-r option) are the first place to look. The NFS statistics show you
how the applications use NFS.
RPC statistics
The nfsstat command displays statistical information about RPC calls.
The types of statistics displayed are:
• Total number of RPC calls received or rejected
• Total number of RPC calls sent or rejected by a server
• Number of times no RPC packet was available when trying to receive
• Number of packets that were too short or had malformed headers
• Number of times a call had to be transmitted again
• Number of times a reply did not match the call
• Number of times a call timed out
• Number of times a call had to wait on a busy client handle
• Number of times authentication information had to be refreshed
The NFS part of the nfsstat command output is divided into Version 2 and Version 3 statistics of NFS.
The RPC part is divided into Connection oriented (TCP) and Connectionless (UDP) statistics.
nfso command
You can use the nfso command to configure NFS attributes.
The nfso command sets or displays NFS-related options associated with the currently running kernel and
NFS kernel extension.
Note: The nfso command performs no range-checking. If it is used incorrectly, the nfso command can
make your system inoperable.
The nfso parameters and their values can be displayed by using the nfso -a command, as follows:
# nfso -a
portcheck = 0
udpchecksum = 1
nfs_socketsize = 60000
nfs_tcp_socketsize = 60000
nfs_setattr_error = 0
nfs_gather_threshold = 4096
nfs_repeat_messages = 0
nfs_udp_duplicate_cache_size = 5000
nfs_tcp_duplicate_cache_size = 5000
nfs_server_base_priority = 0
nfs_dynamic_retrans = 1
nfs_iopace_pages = 0
nfs_max_connections = 0
nfs_max_threads = 3891
nfs_use_reserved_ports = 0
nfs_device_specific_bufs = 1
nfs_server_clread = 1
nfs_rfc1323 = 1
nfs_max_write_size = 65536
nfs_max_read_size = 65536
nfs_allow_all_signals = 0
nfs_v2_pdts = 1
nfs_v3_pdts = 1
nfs_v2_vm_bufs = 1000
nfs_v3_vm_bufs = 1000
nfs_securenfs_authtimeout = 0
nfs_v3_server_readdirplus = 1
lockd_debug_level = 0
statd_debug_level = 0
statd_max_threads = 50
utf8_validation = 1
nfs_v4_pdts = 1
nfs_v4_vm_bufs = 1000
Most NFS attributes are run-time attributes that can be changed at any time. Load time attributes, such
as nfs_socketsize, need NFS to be stopped first and restarted afterwards. The nfso -L command provides
more detailed information about each of these attributes, including the current value, default value, and
the restrictions regarding when the value changes actually take effect:
# nfso –L
Parameter types:
S = Static: cannot be changed
D = Dynamic: can be freely changed
B = Bosboot: can only be changed using bosboot and reboot
R = Reboot: can only be changed during reboot
C = Connect: changes are only effective for future socket connections
M = Mount: changes are only effective for future mountings
I = Incremental: can only be incremented
Value conventions:
K = Kilo: 2^10 G = Giga: 2^30 P = Peta: 2^50
M = Mega: 2^20 T = Tera: 2^40 E = Exa: 2^60
To display or change a specific parameter, use the nfso -o command. For example:
# nfso -o portcheck
portcheck= 0
# nfso -o portcheck=1
# nfso -d portcheck
# nfso -o portcheck
portcheck= 0
Dropped packets
Although dropped packets are typically first detected on an NFS client, the real challenge is to find out
where they are being lost. Packets can be dropped at the client, the server, or anywhere on the network.
The following example shows the server part of the nfsstat command output specified by the -s option:
# nfsstat -s
Server rpc:
Connection oriented:
calls badcalls nullrecv badlen xdrcall dupchecks dupreqs
Server nfs:
calls badcalls public_v2 public_v3
15835 0 0 0
Version 2: (0 calls)
null getattr setattr root lookup readlink read
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
wrcache write create remove rename link symlink
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
mkdir rmdir readdir statfs
0 0% 0 0% 0 0% 0 0%
Version 3: (15835 calls)
null getattr setattr lookup access readlink read
7 0% 3033 19% 55 0% 1008 6% 1542 9% 20 0% 9000 56%
write create mkdir symlink mknod remove rmdir
175 1% 185 1% 0 0% 0 0% 0 0% 120 0% 0 0%
rename link readdir readdir+ fsstat fsinfo pathconf
87 0% 0 0% 1 0% 150 0% 348 2% 7 0% 0 0%
commit
97 0%
# vmo –o maxperm%=100
On a server exporting Enhanced JFS file systems, both the maxclient and maxperm parameters must be
set. The maxclient parameter controls the maximum percentage of memory occupied by client-segment
pages which is where Enhanced JFS file data is cached. Note that the maxclient value cannot exceed the
maxperm value. For example:
# vmo –o maxclient%=100
Under certain conditions, too much file data cached in memory might actually be undesirable. See
“File system performance” on page 214 for an explanation of how you can use a mechanism called
release-behind to flush file data that is not likely to be reused by applications.
# nfsstat -c
Client rpc:
Connection oriented
calls badcalls badxids timeouts newcreds badverfs timers
0 0 0 0 0 0 0
nomem cantconn interrupts
0 0 0
Connectionless
calls badcalls retrans badxids timeouts newcreds badverfs
6553 0 0 0 0 0 0
timers nomem cantsend
0 0 0
Client nfs:
calls badcalls clgets cltoomany
6541 0 0 0
Version 2: (6541 calls)
null getattr setattr root lookup readlink read
0 0% 590 9% 414 6% 0 0% 2308 35% 0 0% 0 0%
wrcache write create remove rename link symlink
0 0% 2482 37% 276 4% 277 4% 147 2% 0 0% 0 0%
mkdir rmdir readdir statfs
6 0% 6 0% 30 0% 5 0%
Version 3: (0 calls)
null getattr setattr lookup access readlink read
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
write create mkdir symlink mknod remove rmdir
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
rename link readdir readdir+ fsstat fsinfo pathconf
0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
commit
0 0%
# nfsstat -m
/SAVE from /SAVE:aixhost.ibm.com
Flags: vers=2,proto=udp,auth=unix,soft,intr,dynamic,rsize=8192,wsize=8192,retrans=5
Lookups: srtt=27 (67ms), dev=17 (85ms), cur=11 (220ms)
Reads: srtt=16 (40ms), dev=7 (35ms), cur=5 (100ms)
Writes: srtt=42 (105ms), dev=14 (70ms), cur=12 (240ms)
All: srtt=27 (67ms), dev=17 (85ms), cur=11 (220ms)
The numbers in parentheses in the example output are the actual times in milliseconds. The other values
are unscaled values kept by the operating system kernel. You can ignore the unscaled values. Response
times are shown for lookups, reads, writes, and a combination of all of these operations, All. Other
definitions used in this output are as follows:
srtt
Smoothed round-trip time
dev
Estimated deviation
cur
Current backed-off timeout value
# vmo –o maxperm%=100
# vmo –o maxclient%=100
Unnecessary retransmits
Related to the hard-versus-soft mount question is the question of the appropriate timeout duration for a
given network configuration.
If the server is heavily loaded, is separated from the client by one or more bridges or gateways, or is
connected to the client by a WAN, the default timeout criterion may be unrealistic. If so, both server and
client are burdened with unnecessary retransmits. For example, if the following command:
# nfsstat -c
reports a significant number, like greater than five percent of the total, of both timeouts and badxids,
you could increase the timeo parameter with the mount command.
Identify the directory you want to change, and enter a new value, in tenths of a second, on the NFS
TIMEOUT line.
The default time is 0.7 second, timeo=7, but this value is manipulated in the NFS kernel extension
depending on the type of call. For read calls, for instance, the value is doubled to 1.4 seconds.
To achieve control over the timeo value for operating system version 4 clients, you must set the
nfs_dynamic_retrans option of the nfso command to 0. There are two directions in which you can
options=noacl
Set this option as part of the client's /etc/filesystems stanza for that file system.
Configuring CacheFS
CacheFS is not implemented by default or prompted at the time of the creation of an NFS file system. You
must specify explicitly which file systems are to be mounted in the cache.
To specify which file systems are to be mounted in the cache, do the following:
1. Create the local cache file system by using the cfsadmin command:
where parameters specify the resource parameters and cache-directory is the name of the directory
where the cache should be created.
2. Mount the back file system onto the cache:
# mount -V cachefs -o backfstype=nfs,cachedir=/cache-directory remhost:/rem-directory local-mount-point
where rem-directory is the name of the remote host and file system where the data resides, and
local-mount-point is the mount point on the client where the remote file system should be mounted.
3. Alternately, you could administer CacheFS using the SMIT command (use the smitty cachefs fast
path).
Several parameters can be set at creation time, as follows:
maxblocks
Sets the maximum number of blocks that CacheFS is allowed to claim within the front file system.
Default = 90 percent.
minblocks
Sets the minimum number of blocks that CacheFS is allowed to claim within the front file system.
Default = 0 percent.
threshblocks
Sets the number of blocks that must be available in the JFS file system on the client side before
CacheFS can claim more than the blocks specified by minblocks. Default = 85 percent.
maxfiles
Maximum number of files that CacheFS can use, expressed as a percentage of the total number of
i-nodes in the front file system. Default = 90 percent.
minfiles
Minimum number of files that CacheFS is always allowed to use, expressed as a percentage of the
total number of i-nodes in the front file system. Default = 0 percent.
maxfilesize
Largest file size, expressed in megabytes, that CacheFS is allowed to cache. Default = 3.
NFS references
There are many files, commands, daemons, and subroutines associated with NFS.
LPAR performance
This topic provides insights and guidelines for considering, monitoring, and tuning AIX performance in
partitions running on POWER4-based systems.
For more information about partitions and their implementation, see AIX 5L Version 5.3 AIX Installation in
a Partitioned Environment or Hardware Management Console Installation and Operations Guide.
System components
Several system components must work together to implement and support the LPAR environment.
The relationship between processors, firmware, and operating system requires that specific functions
need to be supported by each of these components. Therefore, an LPAR implementation is not based
solely on software, hardware, or firmware, but on the relationship between the three components. The
POWER4 microprocessor supports an enhanced form of system call, known as Hypervisor mode, that
allows a privileged program access to certain hardware facilities. The support also includes protection
for those facilities in the processor. This mode allows the processor to access information about systems
located outside the boundaries of the partition where the processor is located. The Hypervisor does use
a small percentage of the system CPU and memory resources, so comparing a workload running with the
Hypervisor to one running without the Hypervisor will typically show some minor impacts.
A POWER4-based system can be booted in a variety of partition configurations, including the following:
• Dedicated hardware system with no LPAR support running so the Hypervisor is not running. This is
called a Full System Partition.
• Partitions running on the system with the Hypervisor running.
Assigned microprocessors
To view a list of microprocessors that are assigned to an LPAR, select the Managed System (CEC) object
on the HMC and view its properties.
There is a tab that displays the current allocation state of all processors that are assigned to running
partitions. AIX uses the firmware-provided numbers, which allows you to see from within a partition the
processors that are used by looking at the microprocessor numbers and AIX location codes.
Verifying the status of the microprocessors assigned to a two-processor partition looks similar to the
following:
# schedo -o vpm_xvcpus
To increase the number of virtual processors in use by 1, you can use the following command:
# schedo -o vpm_xvcpus=1
If the number of virtual processors needed is less than the current number of enabled virtual processors,
a virtual processor is disabled. If the number of virtual processors needed is greater than the current
number of enabled virtual processors, a disabled virtual processor is enabled. Threads that are attached
to a disabled virtual processor are still allowed to run on it.
Note: You should always round up the value that is calculated from the above equation to the next integer.
The following example describes how to calculate the number of virtual processors to use:
Over the last interval, partition A is utilizing two and a half processors. The vpm_xvcpus tunable is set to
1. Using the above equation,
Rounding up the value that was calculated to the next integer equals 4. Therefore, the number of
virtual processors needed on the system is 4. So, if partition A was running with 8 virtual processors,
4 virtual processors are disabled and 4 virtual processors remain enabled. If SMT is enabled, each virtual
processor yields 2 logical processors. So, 8 logical processors are disabled and 8 logical processors are
enabled.
In the following example, a modest workload that is running without the folding feature enabled
consumes a minimal amount of each virtual processor that is allocated to the partition. The following
output from the mpstat -s tool on a system with 4 virtual CPUs, indicates the utilization for the virtual
processor and the two logical processors that are associated with it:
When the folding feature is enabled, the system calculates the number of virtual processors needed with
the equation above. The calculated value is then used to decrease the number of virtual processors to
what is needed to run the modest workload without degrading performance. The following output from
the mpstat -s tool on a system with 4 virtual CPUs, indicates the utilization for the virtual processor and
the two logical processors that are associated with it:
As you can see from the data above, the workload benefits from a decrease in utilization and maintenance
of ancillary processors, and increased affinity when the work is concentrated on one virtual processor.
When the workload is heavy however, the folding feature does not interfere with the ability to use all the
virtual CPUs, if needed.
Application considerations
You should be aware of several things concerning the applications associated with an LPAR.
Generally, an application is not aware that it is running in an LPAR. There are some slight differences that
you are aware of, but these are masked from the application. Apart from these considerations, AIX runs
inside a partition the same way it runs on a standalone server. No differences are observed either from the
> uname -L
-1 NULL
The "-1" indicates that the system is not running with any logical partitions, but is running in full system
partition mode.
The following example demonstrates how the uname command provides the partition number and the
partition name as managed by the HMC:
> uname -L
3 Web Server
Knowing that the application is running in an LPAR can be helpful when you are assessing slight
performance differences.
Virtual console
There is no physical console on each partition.
While the physical serial ports can be assigned to the partitions, they can only be in one partition at a
time. For diagnostic purposes, and to provide an output for console messages, the firmware implements a
virtual tty that is seen by AIX as a standard tty device. The virtual tty output is streamed to the HMC. The
AIX diagnostics subsystem uses the virtual tty as the system console. From a performance perspective, if
a lot of data is being written to the system console, which is being monitored on the HMC's console, the
connection to the HMC is limited by the serial cable connection.
Time-of-Day clock
Each partition has its own Time-of-Day clock values so that partitions can work with different time zones.
The only way partitions can communicate with each other is through standard network connectivity. When
looking at traces or time-stamped information from each of the partitions on a system, each time stamp
will be different according to how the partition was configured.
Memory considerations
There are several issues to consider when dealing with memory.
Partitions are defined with a "must have", a "desired", and a "minimum" amount of memory. When you are
assessing changing performance conditions across system reboots, it is important to know that memory
and CPU allocations might change based on the availability of the underlying resources. Also, remember
that the amount of memory allocated to the partition from the HMC is the total amount allocated. Within
the partition itself, some of that physical memory is used for hypervisor-page-table-translation support.
Memory is allocated by the system across the system. Applications in partitions cannot determine where
memory has been physically allocated.
Micro-Partitioning
Logical partitions allow you to run multiple operating systems on the same system without interference.
In the earlier version of AIX, you were not able to share processors among the different partitions. You
can use shared processor partitions, or SPLPAR, also known as Micro-Partitioning.
Micro-Partitioning facts
Micro-Partitioning maps virtual processors to physical processors and the virtual processors are assigned
to the partitions instead of the physical processors.
You can use the Hypervisor to specify what percentage of processor usage to grant to the shared
partitions, which is defined as an entitlement. The minimum processor entitlement is ten percent.
You can realize the following advantages with Micro-Partitioning:
• Optimal resource utilization
• Rapid deployment of new servers
• Application isolation
Micro-Partitioning is available on IBM Power Systems servers. It is possible to run a variety of partitions
with varying levels of operating systems, but you can only use Micro-Partitioning on partitions running AIX
Version 6.1 or later.
With IBM Power Systems servers, you can choose the following types of partitions from the Hardware
Management Console, or HMC:
Implementation of Micro-Partitioning
As with LPAR, you can define the partitions in Micro-Partitioning with the HMC.
The following table lists the different types of processors you can use with Micro-Partitioning:
Type of
processor Description
Physical A physical processor is the actual hardware resource, which represents the number
processor of unique processor cores, not the number of processor chips. Each chip contains
two processor cores. The maximum number of physical processors is 64 on
POWER5-based systems.
When you create a partition, you must choose whether you want to create a shared processor partition or
a dedicated processor partition. It is not possible to have both shared and dedicated processors in one
partition. To enable the sharing of processors, you must configure the following options:
• The processor sharing mode: Capped or Uncapped1
• The processing capacity: Weight2
• The number of virtual processors: Desired, Minimum, and Maximum
Note: Capped mode means that the processing capacity never exceeds the assigned capacity and
uncapped mode means that the processing capacity can be exceeded when the shared processing pool
has available resources.
Note: The processing capacity is specified in terms of processing units that are measured in fractions of
0.01 of a processor. So for example, to assign a processing capacity for a half of a processor, you must
specify 0.50 processing units on the HMC.
Overview
Active Memory Expansion (AME) relies on compression of in-memory data to increase the amount of data
that can be placed into memory and thus expand the effective memory capacity of a IBM Power Systems
processor-based server. The in-memory data compression is managed by the operating system, and
this compression is transparent to applications and users. AME is configurable on a per-logical partition
(LPAR) basis. Thus, AME can be selectively enabled for one or more LPARs on a system.
When Active Memory Expansion is enabled for an LPAR, the operating system compresses a portion of the
LPAR's memory and leaves the remaining portion of memory uncompressed. This results in the memory
being effectively broken up into two pools. They are:
• Compressed pool
Note: When AME is enabled, by default the AIX operating system selects the optimal page size mode
based on the system configuration. You can use the vmo command with the ame_mpsize_support
parameter to manually enable or disable the page size of 64 KB in an AME environment. The minimum
requirement to use the page size of 64 KB in an AME environment are firmware level of FW860 and
AME accelerator support. AME supports enablement of 64 KB pages by default starting with Power10
processor-based servers.
For example, using a memory expansion factor of 2.0 for an LPAR indicates that memory compression
must be used to double the LPAR's memory capacity. If an LPAR is configured with a memory expansion
factor of 2.0 and a memory size of 20 GB, then the expanded memory size for the LPAR is 40 GB.
40 GB = 20 GB * 2.0
The operating system compresses enough in-memory data to fit 40 GB of data into 20 GB of memory. The
memory expansion factor and the expanded memory size can be dynamically changed at runtime by using
the Hardware Management Console (HMC) through dynamic LPAR operations. The expanded memory size
is always rounded down to the nearest logical memory block (LMB) multiple.
Memory Deficit
When configuring the memory expansion factor for an LPAR, it is possible that a memory expansion factor
might be chosen that is too large and cannot be achieved based on the compressibility of the workload.
When the memory expansion factor for an LPAR is too large, then a memory expansion deficit forms,
indicating that the LPAR cannot achieve its memory expansion factor target. For example, if an LPAR is
configured with a memory size of 20 GB and a memory expansion factor of 1.5, which results in a total
target expanded memory size of 30 GB. However, the workload running in the LPAR does not compress
well, and the workload's data only compresses by a ratio of 1.4 to 1. In this case, it is impossible for
the workload to achieve the targeted memory expansion factor of 1.5. The operating system limits the
amount of physical memory that can be used in a compressed pool up to a maximum of 95%. This
value can be adjusted by using the vmo command with the ame_min_ucpool_size parameter. In the
preceding example with the LPAR memory size as 20 GB, if the ame_min_ucpool_size parameter value
is set to 90, 18 GB will be reserved for compressed pool. The maximum achievable expanded memory
size would be 27.2 GB (2 GB + 1.4 x 18 GB) . The result is a 2.8 GB shortfall. This shortfall is referred to as
the memory deficit.
The effect of a memory deficit is the same as the effect of configuring an LPAR with too little memory.
When a memory deficit occurs, the operating system cannot achieve the expanded memory target
configured for the LPAR, and the operating system might have to resort to paging out virtual memory
pages to paging space. Thus, in the above mentioned example, if the workload used more than 27.2 GB
of memory, the operating system would start paging out virtual memory pages to paging space. To get
an indication of whether a workload can achieve its expanded memory size, the operating system reports
a memory deficit metric. This is a “hole” in the expanded memory size that cannot be achieved. If this
deficit is zero, the target memory expansion factor can be achieved, and the LPAR's memory expansion
factor is configured correctly. If the expanded memory deficit metric is non-zero, then the workload falls
short of achieving its expanded memory size by the size of the deficit. To eliminate a memory deficit,
the LPAR's memory expansion factor should be reduced. However, reducing the memory expansion factor
reduces the LPAR's expanded memory size. Thus to keep the LPAR's expanded memory size the same,
the memory expansion factor must be reduced and more memory must be added to the LPAR. Both the
LPAR's memory size and memory expansion factor can be changed dynamically.
Uncompressed memory
2 GB
Uncompressed pool
2 GB
30 GB
Compressed memory
25.2 GB
1.4X
20 GB
Compression
Compressed pool ratio
18 GB
Memory deficit
2.8 GB
Planning Considerations
Before deploying a workload in the Active Memory Expansion (AME) environment, some initial planning
is required to ensure that a workload gets the maximum benefit from AME. The benefit of AME to
a workload varies based on the workload's characteristics. Some workloads can get a higher level of
memory expansion than other workloads. The Active Memory Expansion Planning and Advisory Tool
amepat assists in planning the deployment of a workload in the Active Memory Expansion environment
and provides guidance on the level of memory expansion a workload can achieve.
# amepat 5 2
System Configuration:
---------------------
Partition Name : aixfvt19
Processor Implementation Mode : POWER5
Number Of Logical CPUs : 8
Processor Entitled Capacity : 4.00
Processor Max. Capacity : 4.00
True Memory : 4.25 GB
SMT Threads : 2
Shared Processor Mode : Disabled
Active Memory Sharing : Disabled
Active Memory Expansion : Disabled
The modeled Active Memory Expansion CPU usage reported by amepat is just an
estimate. The actual CPU usage used for Active Memory Expansion may be lower
or higher depending on the workload.
Command Information
This section provides details about the arguments passed to amepat, time of invocation, the total time
the system is monitored and the number of samples collected. In the report, the amepat
is invoked for 10 minutes with an interval of 2 minutes and 5 samples.
Note: It can be observed that the Total Monitored time displayed is 10 minutes and 58 seconds. The extra
58 seconds is used for gathering system statistics required for the Active Memory Expansion Modeling.
System Configuration
This section provides configuration information such as host name, processor architecture, CPUs,
entitlement, true memory size, SMT state, processor, and memory mode. In the report above, the disabled
status of Active Memory Expansion indicates that amepat is invoked in an AME-disabled partition.
Note: The amepat tool can also be invoked in an AME-enabled partition to monitor and fine-tune an
active AME configuration. In an AME-enabled partition, the System Configuration Section also displays the
target expanded memory size and the target memory expansion factor.
This section of the report shows the AME CPU Usage, Compressed Memory,
Compression Ratio & Deficit Memory.
The Deficit Memory will be reported only if there is a memory deficit
in achieving the expanded memory size.
Otherwise the tool will not report this information.
For example in the above report, it can be observed that there is
an average memory deficit of 562 MB which is 55% of the Target
Here the original true memory size of 4.25 GB can be achieved with 2.75 GB physical memory size and an
expansion factor of 1.55. This configuration may result in the CPU usage increasing by
0.39 physical processors (10% of maximum capacity).
For more information on these and other uses of the AME Planning tool, please refer to the amepat man
page.
vmstat command
The vmstat command can be used with it's –c option to display AME statistics.
# vmstat –c 2 1
cpu
us sy id wa pc ec
2 3 89 7 0.02 5.3
In the output above, the following memory compression statistics are provided:
• Expanded memory size mem of the LPAR is 1024 MB.
• True memory size tmem of the LPAR is 512 MB.
• The memory mode mmode of the LPAR is Active Memory Sharing disabled and Active Memory
Expansion enabled.
• Compressed Pool size csz is 43332 4K- pages.
• Amount of free memory cfr in the compressed pool is 943 4K- pages.
• Size of expanded memory deficit dxm is 26267 4K- pages.
• Number of compression operations or page-outs to the compressed pool per second co is 386.
• Number of decompression operations or page-ins from the compressed pool per second ci is 174.
# lparstat -c 2 5
%user %sys %wait %idle physc %entc lbusy app vcsw phint %xcpu dxm
----- ----- ------ ------ ----- ----- ------ --- ----- ----- ------ ------
45.6 51.3 0.2 2.8 0.95 236.5 62.6 11.82 7024 2 5.8 165
46.1 50.9 0.1 2.8 0.98 243.8 64.5 11.80 7088 7 6.0 162
46.8 50.9 0.3 2.1 0.96 241.1 69.6 11.30 5413 6 19.4 163
49.1 50.7 0.0 0.3 0.99 247.3 60.8 10.82 636 4 8.6 152
49.3 50.5 0.0 0.3 1.00 248.9 56.7 11.47 659 1 0.3 153
topas command
The topas main panel in an LPAR with Active Memory Expansion enabled
displays memory compression statistics automatically under the sub-section AME.
In the above output, the following memory compression statistics are provided.
• True memory size TMEM,MB of the LPAR is 512 MB.
• Compressed pool size CMEM,MB of the LPAR is 114 MB.
• EF[T/A] – Target Expansion Factor is 2.0 and Achieved Expansion Factor is 1.5.
• Rate of compressions co and decompressions ci per second are 0.0 and 5.5 pages respectively.
# svmon -G -O summary=ame,pgsz=on,unit=MB
Unit: MB
-------------------------------------------------------------------------------
size inuse free pin virtual available mmode
memory 1024.00 607.54 144.11 306.29 559.75 136.61 Ded-E
ucomprsd - 387.55 -
comprsd - 219.98 -
pg space 512.00 5.08
In the output above, the following memory compression statistics are provided:
• Memory mode mmode of the LPAR is Active Memory Sharing disabled and AME enabled.
• Out of a total of 607.54 MB in use memory_inuse, uncompressed pages ucomprsd_inuse constitute
387.55 MB and compressed pages comprsd_inuse constitute the remaining 219.98 MB
• Out of a total of 534.12 MB working pages in use inuse_work, uncompressed pages ucomprsd_work
constitute 314.13 MB and compressed pages comprsd_work constitute 219.98 MB.
• Out of a total of 543.54 MB of in use pages 4KB_inuse in 4K-PageSize Pool, uncompressed pages
4KB_ucomprsd constitute 323.55 MB.
• Expanded memory size memory_size of the LPAR is 1024 MB.
• True memory size True Memory of the LPAR is 512 MB.
• Current size of the uncompressed pool ucomprsd_CurSz is 405.93 MB (79.28% of the total true
memory size of the LPAR).
• Current size of the compressed pool comprsd_CurSz is 106.07 MB (20.72% of the total true memory
size of the LPAR).
• The target size of the compressed pool comprsd_TgtSz needed to achieve the target memory expansion
factor txf of 2.00 is 343.62 MB (67.11% of the total true memory size of the LPAR).
• The size of the uncompressed pool ucomprsd_TgtSz in that case becomes 168.38 MB (32.89% of the
total true memory size of the LPAR).
• The maximum size of the compressed pool comprsd_MaxSz is 159.59 MB (31.17% of the total true
memory size of the LPAR).
• The current compression ratio CRatio is 2.51 and the current expansion factor cxf achieved is 1.46
• The amount of expanded memory deficit dxm is 274.21 MB and the deficit expansion factor dxf is 0.54.
The –O summary=longame option provides summary of memory compression details as follows:
# svmon -G -O summary=longame,unit=MB
Unit: MB
Active Memory Expansion
--------------------------------------------------------------------
Size Inuse Free DXMSz UCMInuse CMInuse TMSz TMFr
1024.00 607.91 142.82 274.96 388.56 219.35 512.00 17.4
Application Tuning
Before spending a lot of effort to improve the performance of a program, use the techniques in this
section to help determine how much its performance can be improved and to find the areas of the
program where optimization and tuning will have the most benefit.
In general, the optimization process involves several steps:
• Some tuning involves changing the source code, for example, by reordering statements and
expressions. This technique is known as hand tuning.
• For FORTRAN and C programs, optimizing preprocessors are available to tune and otherwise transform
source code before it is compiled. The output of these preprocessors is FORTRAN or C source code that
has been optimized.
• The FORTRAN or C++ compiler translates the source code into an intermediate language.
• A code generator translates the intermediate code into machine language. The code generator can
optimize the final executable code to speed it up, depending on the selected compiler options. You can
increase the amount of optimization performed in this step by hand-tuning or preprocessing first.
The speed increase is affected by two factors:
• The amount of optimization applied to individual parts of the program
• The frequency of use for those parts of the program at run time
Speeding up a single routine might speed up the program significantly if that routine performs the
majority of the work, on the other hand, it might not improve overall performance much if the routine is
rarely called and does not take long anyway. Keep this point in mind when evaluating the performance
techniques and data, so that you focus on the techniques that are most valuable in your work.
For an extensive discussion of these techniques, see Optimization and Tuning Guide for XL Fortran, XL C
and XL C++. Also see “Efficient Program Design and Implementation” on page 79 for additional hints and
tips.
Recommendations
Follow these guidelines for optimization:
• Use -O2 or -O3 -qstrict for any production-level FORTRAN, C, or C++ program you compile. For High
Performance FORTRAN (HPF) programs, do not use the -qstrict option.
• Use the -qhot option for programs where the hot spots are loops or array language. Always use the
-qhot option for HPF programs.
• Use the -qipa option near the end of the development cycle if compilation time is not a major
consideration.
The -qipa option activates or customizes a class of optimizations known as interprocedural analysis. The
-qipa option has several suboptions that are detailed in the compiler manual. It can be used in two ways:
• The first method is to compile with the -qipa option during both the compile and link steps. During
compilation, the compiler stores interprocedural analysis information in the .o file. During linking, the
-qipa option causes a complete recompilation of the entire application.
• The second method is to compile the program for profiling with the -p/-pg option (with or without
-qipa), and run it on a typical set of data. The resulting data can then be fed into subsequent
compilations with -qipa so that the compiler concentrates optimization in the seconds of the program
that are most frequently used.
Using -O4 is equivalent to using -O3 -qipa with automatic generation of architecture and tuning option
ideal for that platform. Using the -O5 flag is similar to -O4 except that -qipa= level = 2.
You gain the following benefits when you use compiler optimization:
Branch optimization
Rearranges the program code to minimize branching logic and to combine physically separate blocks
of code.
Code motion
If variables used in a computation within a loop are not altered within the loop, the calculation can be
performed outside of the loop and the results used within the loop.
Common subexpression elimination
In common expressions, the same value is recalculated in a subsequent expression. The duplicate
expression can be eliminated by using the previous value.
Constant propagation
Constants used in an expression are combined, and new ones are generated. Some implicit
conversions between integers and floating-point types are done.
Dead code elimination
Eliminates code that cannot be reached or where the results are not subsequently used.
Dead store elimination
Eliminates stores when the value stored is never referenced again. For example, if two stores to the
same location have no intervening load, the first store is unnecessary and is removed.
Recommendations
Follow these guidelines for compiling for specific hardware platforms:
• If your program will be run only on a single system, or on a group of systems with the same processor
type, use the -qarch option to specify the processor type.
• If your program will be run on systems with different processor types, and you can identify one
processor type as the most important, use the appropriate -qarch and -qtune settings. XL FORTRAN
and XL HPF users can use the xxlf and xxlhpf commands to select these settings interactively.
Recommendations
Follow these guidelines:
• For single-precision programs on POWER family and POWER2 platforms, you can enhance performance
while preserving accuracy by using these floating-point options:
-qfloat=fltint:rsqrt:hssngl
If your single-precision program is not memory-intensive (for example, if it does not access more data
than the available cache space), you can obtain equal or better performance, and greater precision, by
using:
-qfloat=fltint:rsqrt -qautodbl=dblpad4
For programs that do not contain single-precision variables, use -qfloat=rsqrt:fltint only. Note that -O3
without -qstrict automatically sets -qfloat=rsqrt:fltint.
• Single-precision programs are generally more efficient than double-precision programs, so promoting
default REAL values to REAL(8) can reduce performance. Use the following -qfloat suboptions:
This option forces the linker to place the library procedures your program references into the program's
object file. The /lib/syscalIs.exp file contains the names of system routines that must be imported
to your program from the system. This file must be specified for static linking. The routines that it names
are imported automatically by libc.a for dynamic linking, so you do not need to specify this file during
dynamic linking. For further details on these options, see “Efficient use of the ld command” on page 375
and the Id command.
do l=1,control
do j=1,control
xmult=0.d0
do k=1,control
xmult=xmult+a(i,k)*a(k,j)
end do
b(i,j)=xmult
were replaced by the following line of FORTRAN that calls a BLAS routine:
This example demonstrates how a program using matrix multiplication operations could better use a
level 3 BLAS routine for enhanced performance. Note that the improvement increases as the array size
increases.
Code-optimization techniques
The degradation from inefficient use of memory is much greater than that from inefficient use of the
caches, because the difference in speed between memory and disk is much higher than between cache
and memory.
Code-optimization techniques include the following:
Mapped files
The use of mapped files is a code-optimization technique.
Applications can use the shmat() or mmap() system calls to access files by address, instead of using
multiple read and write system calls. Because there is always overhead associated with system calls,
the fewer calls used, the better. The shmat() or mmap() calls can enhance performance up to 50 times
compared with traditional read() or write() system calls. To use the shmat() subroutine, a file is opened
and a file descriptor (fd) returned, just as if read or write system calls are being used. A shmat() call then
returns the address of the mapped file. Setting elements equal to subsequent addresses in a file, instead
of using multiple read system calls, does read from a file to a matrix.
The mmap() call allows mapping of memory to cross segment boundaries. A user can have more than
10 areas mapped into memory. The mmap() functions provide page-level protection for areas of memory.
Individual pages can have their own read or write, or they can have no-access permissions set. The
mmap() call allows the mapping of only one page of a file.
The shmat() call also allows mapping of more than one segment, when a file being mapped is greater
than a segment.
The following example program reads from a file using read statements:
fd = open("myfile", O_RDONLY);
for (i=0;i<cols;i++) {
for (j=0;j<rows;j++) {
read(fd,&n,sizeof(char));
*p++ = n;
}
}
Using the shmat() subroutine, the same result is accomplished without read statements:
fd = open("myfile", O_RDONLY);
nptr = (signed char *) shmat(fd,0,SHM_MAP | SHM_RDONLY);
for (i=0;i<cols;i++) {
for (j=0;j<rows;j++) {
*p++ = *nptr++;
}
}
The only drawback to using explicitly mapped files is on the writes. The system write-behind feature, that
periodically writes modified pages to a file in an orderly fashion using sequential blocks, does not apply
when an application uses the shmat() or mmap() subroutine. Modified pages can collect in memory and
will only be written randomly when the Virtual Memory Manager (VMM) needs the space. This situation
often results in many small writes to the disk, causing inefficiencies in CPU and disk usage.
Advantages of Java
Java has significant advantages over other languages and environments that make it suitable for just
about any programming task.
The advantages of Java are as follows:
• Java is easy to learn.
Java was designed to be easy to use and is therefore easy to write, compile, debug, and learn than other
programming languages.
• Java is object-oriented.
This allows you to create modular programs and reusable code.
• Java is platform-independent.
One of the most significant advantages of Java is its ability to move easily from one computer system
to another. The ability to run the same program on many different systems is crucial to World Wide Web
software, and Java succeeds at this by being platform-independent at both the source and binary levels.
Because of Java's robustness, ease of use, cross-platform capabilities and security features, it has
become a language of choice for providing worldwide Internet solutions.
The trace command generates statistics on user processes and kernel subsystems. The binary
information is written to two alternate buffers in memory. The trace process then transfers the
information to the trace log file on disk. This file grows very rapidly. The trace program runs as a
process which may be monitored by the ps command. The trace command acts as a daemon, similar to
accounting.
The following figure illustrates the implementation of the trace facility.
Monitoring facilities use system resources. Ideally, the overhead should be low enough as to not
significantly affect system execution. When the trace program is active, the CPU overhead is less than 2
percent. When the trace data fills the buffers and must be written to the log, additional CPU is required for
file I/O. Usually this is less than 5 percent. Because the trace program claims and pins buffer space, if
the environment is memory-constrained, this might be significant. Be aware that the trace log and report
files can become very large.
captures the execution of the cp command. We have used two features of the trace command. The -k
"20e,20f" option suppresses the collection of events from the lockl() and unlockl() functions. These calls
are numerous on uniprocessor systems, but not on SMP systems, and add volume to the report without
giving us additional information. The -o trc_raw option causes the raw trace output file to be written in
our local directory.
This reports both the fully qualified name of the file that is run and the process ID that is assigned to it.
The report file shows us that there are numerous VMM page assign and delete events in the trace, like the
following sequence:
We are not interested in this level of VMM activity detail at the moment, so we reformat the trace as
follows:
The -k "1b0,1b1" option suppresses the unwanted VMM events in the formatted output. It saves us
from having to retrace the workload to suppress unwanted events. We could have used the -k function
of the trcrpt command instead of that of the trace command to suppress the lockl() and unlockl()
events, if we had believed that we might need to look at the lock activity at some point. If we had
been interested in only a small set of events, we could have specified -d "hookid1,hookid2" to produce
a report with only those events. Because the hook ID is the leftmost column of the report, you can
quickly compile a list of hooks to include or exclude. A comprehensive list of trace hook IDs is defined in
the /usr/include/sys/trchkid.h file.
The body of the report, if displayed in a small enough font, looks similar to the following:
ID PROCESS NAME PID ELAPSED_SEC DELTA_MSEC APPL SYSCALL KERNEL INTERRUPT
101 ksh 8526 0.005833472 0.107008 kfork LR = D0040AF8
101 ksh 7214 0.012820224 0.031744 execve LR = 10015390
134 cp 7214 0.014451456 0.030464 exec: cmd=cp ../bin/track /tmp/junk pid=7214
tid=24713
You should be able to see that event ID 15b is the OPEN SYSTEM CALL event. Now, process the data from
the copy example as follows:
The report is written to standard output, and you can determine the number of open() subroutines that
occurred. If you want to see only the open() subroutines that were performed by the cp process, run the
report command again using the following:
By specifying the -d (defer tracing until the trcon subcommand is entered) option, you can limit how
much tracing is done on the trace command itself. If the -d option is not specified, then tracing begins
immediately and can log events for the trace command initializing its own memory buffers. Typically, we
want to trace everything but the trace command itself.
By default, the kernel buffer size (-T option) can be at most one half of the log buffer size (-L option). If
you use the -f flag, the buffer sizes can be the same.
The -n option is useful if there are kernel extension system calls that need to be traced.
where args is the options list that you would have entered for the trace command. By default, the
system trace (channel 0) is started. If you want to start a generic trace, include a -g option in the args
string. On successful completion, the trcstart() subroutine returns the channel ID. For generic tracing, this
channel ID can be used to record to the private generic channel.
When compiling a program using this subroutine, the link to the librts.a library must be specifically
requested (use -l rts as a compile option).
# smitty trcrpt
# trcrpt -o newfile
If you want to use the trace file for other performance tools such as tprof, pprof, netpmon, and
filemon, run the gennames Gennames_File command.
That file is then used with the -n flag of the trcrpt command, as follows:
If -n is not specified, then the trcrpt command generates a symbol table from the system on which the
trcrpt command is run.
Additionally, a copy of the /etc/trcfmt file from the system being traced might be beneficial because
that system may have different or more trace format stanzas than the system where the trcrpt
command is being run. The trcrpt command can use the -t flag to specify the trace format file (by
default it uses the /etc/trcfmt file from the system where the trcrpt command is being run). For
example:
This trace.tr file can then be used as input for other commands (it will include the trace data from each
CPU). The reason for the -C flag on trace is so that the trace can keep up with each CPU's activities on
those systems which have many CPUs (more than 12, for example). Another reason is that the buffer size
for the trace buffers is per CPU when you use the -C all flag.
An event record should be as short as possible. Many system events use only the hook word and time
stamp. A long format allows the user to record a variable length of data. In this long form, the 16-bit data
field of the hook word is converted to a length field that describes the length of the event record.
Trace channels
The trace facility can accommodate up to eight simultaneous channels of trace-hook activity, which are
numbered 0-7.
Channel 0 is always used for system events, but application events can also use it. The other seven
channels, called generic channels, can be used for tracing application-program activity.
When trace is started, channel 0 is used by default. A trace -n channel_number command starts trace to
a generic channel. Use of the generic channels has some limitations:
• The interface to the generic channels costs more CPU time than the interface to channel 0 because of
the need to distinguish between channels and because generic channels record variable-length records.
• Events recorded on channel 0 and on the generic channels can be correlated only by time stamp, not by
sequence, so there may be situations in which it is not possible to determine which event occurred first.
TRCHKL0T(hw)
TRCHKL1T(hw,D1)
TRCHKL2T(hw,D1,D2)
TRCHKL3T(hw,D1,D2,D3)
TRCHKL4T(hw,D1,D2,D3,D4)
TRCHKL5T(hw,D1,D2,D3,D4,D5)
TRCHKL0(hw)
TRCHKL1(hw,D1)
TRCHKL2(hw,D1,D2)
TRCHKL3(hw,D1,D2,D3)
TRCHKL4(hw,D1,D2,D3,D4)
TRCHKL5(hw,D1,D2,D3,D4,D5)
All trace events are time stamped regardless of the macros used.
The type field of the trace event record is set to the value that corresponds to the macro used, regardless
of the value of those four bits in the hw parameter.
Only two macros record events to one of the generic channels (1-7). These are as follows:
TRCGEN(ch,hw,D1,len,buf)
TRCGENT(ch,hw,D1,len,buf)
These macros record in the event stream specified by the channel parameter (ch), a hook word (hw), a
data word (D1) and len bytes from the user's data segment beginning at the location specified by buf.
#include <sys/trcctl.h>
#include <sys/trcmacros.h>
#include <sys/trchkid.h>
char *ctl_file = "/dev/systrctl";
int ctlfd;
int i;
main()
{
printf("configuring trace collection \n");
for(i=1;i<11;i++){
TRCHKL1T(HKWD_USER1,i);
exit(0);
}
When you compile the sample program, you must link to the librts.a library as follows:
HKWD_USER1 is event ID 010 hexadecimal (you can verify this by examining the /usr/include/sys/
trchkid.h file). The report facility does not format the HKWD_USER1 event, unless rules are provided in
the trace format file. The following example of a stanza for HKWD_USER1 could be used:
When you enter the example stanza, do not modify the master format file /etc/trcfmt, but instead
make a copy and keep it in your own directory (assume you name it mytrcfmt). When you run the sample
program, the raw event data is captured in the default log file because no other log file was specified
to the trcstart() subroutine. You can filter the output report to get only your events. To do this, run the
trcrpt command as follows:
Command
Function
bindprocessor
Binds or unbinds the kernel threads of a process to a processor
caccelstat
Reports statistics that are related to coherent accelerators for the entire system, or for each
accelerator and process.
chdev
Changes the characteristics of a device
chlv
Changes only the characteristics of a logical volume
chps
Changes attributes of a paging space
fdpr
A performance tuning utility for improving execution time and real memory utilization of user-level
application programs
ifconfig
Configures or displays network interface parameters for a network using TCP/IP
ioo
Sets I/O related tuning parameters (along with vmo, replaces vmtune command.
migratepv
Moves allocated physical partitions from one physical volume to one or more other physical volumes
mkps
Adds an additional paging space to the system
nfso
Configures Network File System (NFS) network variables
nice
Runs a command at a lower or higher priority
no
Configures network attributes
Performance-related subroutines
AIX supports several subroutines that can be used in monitoring and tuning performance.
bindprocessor()
Binds kernel threads to a processor
getpri()
Determines the scheduling priority of a running process
getpriority()
Determines the nice value of a running process
getrusage()
Retrieves information about the use of system resources
nice()
Increments the nice value of the current process
psdanger()
Retrieves information about paging space use
setpri()
Changes the priority of a running process to a fixed priority
setpriority()
Sets the nice value of a running process
# ld -r libfoo.a -o libfooa.o
Notice that the prebound library is treated as another ordinary input file, not with the usual library
identification syntax (-lfoo).
3. To recompile the module and rebind the executable program after fixing a bug, use the following:
4. However, if the bug fix had resulted in a call to a different subroutine in the library, the bind would fail.
The following Korn shell script tests for a failure return code and recovers:
# !/usr/bin/ksh
# Shell script for source file replacement bind
#
xlf something.f a.out
rc=$?
if [ "$rc" != 0 ]
then
echo "New function added ... using libfooa.o"
xlf something.o libfooa.o
fi
#include <stdio.h>
#include <sys/time.h>
int main(void) {
timebasestruct_t start, finish;
int val = 3;
int w1, w2;
double time;
To minimize the overhead of calling and returning from the timer routines, you can experiment with
binding the benchmark nonshared (see “When to use dynamic linking and static linking ” on page 348).
If this was a real performance benchmark, the code would be measured repeatedly. A number of
consecutive repetitions would be timed collectively, an average time for the operation would be
calculated, but it might include interrupt handling or other extraneous activity. If a number of repetitions
was timed individually, the individual times could be inspected for reasonableness, but the overhead of
the timing routines would be included in each measurement. It may be desirable to use both techniques
and compare the results. In any case, you would want to consider the purpose of the measurements in
choosing the method.
.globl .rtc_upper
.rtc_upper: mfspr 3,4 # copy RTCU to return register
br
.globl .rtc_lower
.rtc_lower: mfspr 3,5 # copy RTCL to return register
br
double second()
{
int ts, tl, tu;
ts = rtc_upper(); /* seconds */
tl = rtc_lower(); /* nanoseconds */
tu = rtc_upper(); /* Check for a carry from */
if (ts != tu) /* the lower reg to the upper. */
tl = rtc_lower(); /* Recover from the race condition. */
return ( tu + (double)tl/1000000000 );
}
The subroutine second() can be called from either a C routine or a FORTRAN routine.
Note: Depending on the length of time since the last system reset, the second.c module might yield a
varying amount of precision. The longer the time since reset, the larger the number of bits of precision
consumed by the whole-seconds part of the number. The technique shown in the first part of this
appendix avoids this problem by performing the subtraction required to obtain an elapsed time before
converting to floating point.
#include <stdio.h>
double second();
main()
{
double t1,t2;
t1 = second();
my_favorite_function();
t2 = second();
double precision t1
double precision t2
t1 = second()
my_favorite_subroutine()
t2 = second()
write(6,11) (t2 - t1)
11 format(f20.12)
end
When using earlier releases, use the uname command. Running the uname -m command produces output
of the following form:
xxyyyyyymmss
where:
xx
00
yyyyyy
Unique CPU ID
mm
Model ID (the numbers to use to determine microprocessor speed)
ss
00 (Submodel)
By cross-referencing the mm values from the uname -m output with the table below, you can determine
the processor speed.
Note:
1. For systems where the uname -m command outputs a model ID of 4C; in general, the only way
to determine the processor speed of a machine with a model ID of 4C is to reboot into System
Management Services and choose the system configuration options. However, in some cases, the
information gained from the uname -M command can be helpful, as shown in the following table.
2. For J-Series, R-Series, and G-Series systems, you can determine the processor speed in an MCA SMP
system from the FRU number of the microprocessor card by using the following command:
3. For the E-series and F-30 systems use the following command to determine microprocessor speed:
# lscfg -vp | pg
Part Number.................093H5280
EC Level....................00E76527
Serial Number...............17700008
FRU Number..................093H2431
Displayable Message.........CPU Card
Device Specific.(PL)........
Device Specific.(ZA)........PS=166,PB=066,PCI=033,NP=001,CL=02,PBH
Z=64467000,PM=2.5,L2=1024
Device Specific.(RM)........10031997 140951 VIC97276
ROS Level and ID............03071997 135048
In the section Device Specific.(ZA), the section PS= is the processor speed expressed in MHz.
4. For F-50 and H-50 systems and SP Silver Node, the following commands can be used to determine the
processor speed of an F-50 system:
Orca M5 CPU:
Part Number.................08L1010
EC Level....................E78405
Serial Number...............L209034579
FRU Number..................93H8945
Manufacture ID..............IBM980
Version.....................RS6K
Displayable Message.........OrcaM5 CPU DD1.3
Product Specific.(ZC).......PS=0013c9eb00,PB=0009e4f580,SB=0004f27
ac0,NP=02,PF=461,PV=05,KV=01,CL=1
In the line containing Product Specific.(ZC), the entry PS= is the processor speed in
hexadecimal notation. To convert this to an actual speed, use the following conversions:
Programming considerations
There are several programming issues concerning National Language Support.
Historically, the C language has displayed a certain amount of provinciality in its interchangeable use of
the words byte and character. Thus, an array declared char foo[10] is an array of 10 bytes. But not all
of the languages in the world are written with characters that can be expressed in a single byte. Japanese
and Chinese, for example, require two or more bytes to identify a particular graphic to be displayed.
Therefore, we distinguish between a byte, which is 8 bits of data, and a character, which is the amount of
information needed to represent a single graphic.
Two characteristics of each locale are the maximum number of bytes required to express a character in
that locale and the maximum number of output display positions a single character can occupy. These
values can be obtained with the MB_CUR_MAX and MAX_DISP_WIDTH macros. If both values are 1, the
locale is one in which the equivalence of byte and character still holds. If either value is greater than 1,
programs that do character-by-character processing, or that keep track of the number of display positions
used, must use internationalization functions to do so.
Because the multibyte encodings consist of variable numbers of bytes per character, they cannot be
processed as arrays of characters. To allow efficient coding in situations where each character has to
receive extensive processing, a fixed-byte-width data type, wchar_t, has been defined. A wchar_t is wide
enough to contain a translated form of any supported character encoding. Programmers can therefore
declare arrays of wchar_t and process them with (roughly) the same logic they would have used on an
array of char, using the wide-character analogs of the traditional libc.a functions.
Unfortunately, the translation from the multibyte form in which text is entered, stored on disk, or written
to the display, to the wchar_t form, is computationally quite expensive. It should only be performed in
situations in which the processing efficiency of the wchar_t form will more than compensate for the cost
of translation to and from the wchar_t form.
if (strcmp(foostr,"a rose") == 0)
you are not looking for "a rose" by any other name; you are looking for that set of bits only. If foostr
contains "a rosE" no match is found.
• Unequal comparisons occur when you are attempting to arrange strings in the locale-defined collation
sequence. In that case, you would use
and pay the performance cost of obtaining the collation information about each character.
• When a program is executed, it always starts in the C locale. If it will use one or more
internationalization functions, including accessing message catalogs, it must execute:
setlocale(LC_ALL, "");
to switch to the locale of its parent process before calling any internationalization function.
LANG=C
export LANG
sets the default locale to C (that is, C is used unless a given variable, such as LC_COLLATE, is explicitly set
to something else).
The following sequence:
LC_ALL=C
export LC_ALL
Tunable parameters
There are many operating system parameters that can affect performance.
The parameters are described in alphabetical order within each section.
Environment variables
There are two types of environmental variables: thread support tunable parameters and miscellaneous
tunable parameters.
Item Descriptor
Purpose: Maintains a list of condition variables for use by the debugger.
Values: Default: OFF. Range: ON, OFF.
Display: echo $AIXTHREAD_COND_DEBUG
This value is turned on internally, so the initial default value is not seen with the
echo command.
Diagnosis: Leaving this variable set to ON makes debugging threaded applications easier, but
might impose some overhead.
Tuning: If the program contains large number of active condition variables and frequently
creates and destroys condition variables, this might create higher overhead for
maintaining the list of condition variables. Setting the variable to OFF disables the
list.
Item Descriptor
Purpose: Controls whether the stack guardpages are disclaimed.
Values: Default: OFF. Range: ON, OFF.
Display: echo $AIXTHREAD_DISCLAIM_GUARDPAGES
This value is turned on internally, so the initial default value is not seen with the
echo command.
Change: AIXTHREAD_DISCLAIM_GUARDPAGES={ON|OFF};export
AIXTHREAD_DISCLAIM_GUARDPAGES
Change takes effect immediately in this shell. Change is effective until logging out
of this shell. Permanent change is made by adding AIXTHREAD_GUARDPAGES=n
command to the /etc/environment file.
Diagnosis: NA
Diagnosis: Setting this parameter to ON allows for resource collection of all pthreads in a
process, but will impose some overhead.
Tuning: N/A
Item Descriptor
Purpose: Controls the number of guardpages to add to the end of the pthread stack.
Values: Default: 1 (where 1 is a decimal value for the number of pages, which can be 4K, 64K
and so on.) Range: A range of n.
Display: echo $AIXTHREAD_GUARDPAGES
This is turned on internally, so the initial default value will not be seen with the echo
command.
Diagnosis: N/A
Tuning: N/A
Item Descriptor
Purpose: Controls the minimum number of kernel threads that should be used.
Values: Default: 8. Range: A positive integer value.
Display: echo $AIXTHREAD_MINKTHREADS
This is turned on internally, so the initial default value will not be seen with the echo
command.
Item Descriptor
Purpose: Controls the scaling factor of the library. This ratio is used when creating and
terminating pthreads.
Values: Default: 8:1 Range: Two positive values (p:k), where k is the number of kernel
threads that should be employed to handle the number of executable pthreads
defined in the p variable.
Display: echo $AIXTHREAD_MNRATIO
This is turned on internally, so the initial default value will not be seen with the echo
command.
Diagnosis: N/A
Tuning: Might be useful for applications with a very large number of threads. However,
always test a ratio of 1:1 because it might provide better performance.
Item Descriptor
Purpose: Maintains a list of active mutexes for use by the debugger.
Values: Default: OFF. Possible values: ON, OFF.
Display: echo $AIXTHREAD_MUTEX_DEBUG
This is turned on internally, so the initial default value will not be seen with the echo
command.
Diagnosis: Setting the variable to ON makes debugging threaded applications easier, but might
impose some overhead.
Tuning: If the program contains a large number of active mutexes and frequently creates
and destroys mutexes, this might create higher overhead for maintaining the list of
mutexes. Leaving the variable set to OFF disables the list.
Diagnosis: Setting the variable to ON forces threaded applications to use an optimized mutex
locking mechanism, resulting in increased performance.
Tuning: If the program experiences performance degradation due to heavy mutex
contention, then setting this variable to ON will force the pthread library to use an
optimized mutex locking mechanism that works only on process private mutexes.
These process private mutexes must be initialized using the pthread_mutex_init
routine and must be destroyed using the pthread_mutex_destroy routine.
Item Descriptor
Purpose: Controls the read access to guardpages that are added to the end of the pthread
stack.
Values: Default: OFF. Range: ON, OFF.
Display: echo $AIXTHREAD_READ_GUARDPAGES
This is turned on internally, so the initial default value will not be seen with the echo
command.
Diagnosis: N/A
Tuning: N/A
Diagnosis: Setting this parameter to ON makes debugging threaded applications easier, but
might impose some overhead.
Tuning: If the program contains a large number of active read-write locks and frequently
creates and destroys read-write locks, this might create higher overhead for
maintaining the list of read-write locks. Setting the variable to OFF will disable the
list.
Item Descriptor
Purpose: Prevents deadlock in applications that use the following routines with the
pthread_suspend_np or pthread_suspend_others_np routines:
• pthread_getrusage_np
• pthread_cancel
• pthread_detach
• pthread_join
• pthread_getunique_np
• pthread_join_np
• pthread_setschedparam
• pthread_getschedparam
• pthread_kill
Diagnosis: If fewer threads are being dispatched than expected, try system scope.
Tuning: Tests show that certain applications can perform better with system-based
contention scope (S). The use of this environment variable impacts only those
threads created with the default attribute. The default attribute is employed when
the attr parameter to pthread_create is NULL.
Item Descriptor
Purpose: Controls the number of kernel threads that should be held in reserve for sleeping
threads.
Values: Default: 1:12. Range: Two positive values (k:p), where k is the number of kernel
threads that should be held in reserve for p sleeping pthreads.
Display: echo $AIXTHREAD_SLPRATIO
This is turned on internally, so the initial default value will not be seen with the echo
command.
Diagnosis: N/A
Tuning: In general, fewer kernel threads are required to support sleeping pthreads, because
they are generally woken one at a time. This conserves kernel resources.
Diagnosis: If analysis of a failing program indicates stack overflow, the default stack size can be
increased.
Tuning: If trying to reach the 32 000 thread limit on a 32-bit application, it might be
necessary to decrease the default stack size.
16. AIXTHREAD_AFFINITY
Item Descriptor
Purpose: Controls the placement of pthread structures, stacks, and thread-local storage on an
enhanced affinity enabled system.
Values: Default: existing. Range: existing, always, attempt.
Display: echo $AIXTHREAD_AFFINITY
This is turned on internally, so the initial default value will not be seen with theecho
command.
Diagnosis: Setting the variable to strict will improve the performance of threads, however, at
the cost of additional start-up time.
Setting the variable to default maintains the previous balanced implementation.
Setting the variable to first-touch will balance the start-up performance costs along
with the run-time benefits.
Tuning: If threads are expected to be long-running, then setting the variable to strict will
improve performance. However, large number of short-running threads should set
the variable either to default or first touch.
Item Descriptor
Purpose: Enables buckets-based extension in the default memory allocator that might
enhance performance of applications that issue large numbers of small allocation
requests.
Item Descriptor
Purpose: Changes the number of the default number of run queues.
Values: Default: Number of active processors found at run time. Range: A positive integer.
Display: echo $NUM_RUNQ
This is turned on internally, so the initial default value will not be seen with the echo
command.
Diagnosis: N/A
Tuning: N/A
Diagnosis: N/A
Tuning: N/A
Item Descriptor
Purpose: Controls the number of times to retry a busy lock before yielding to another
processor (only for libpthreads).
Values: Default: 1 on uniprocessors, 40 on multiprocessors. Range: A positive integer.
Display: echo $SPINLOOPTIME
This is turned on internally, so the initial default value will not be seen with the echo
command.
Diagnosis: If threads are going to sleep often (lot of idle time), then the SPINLOOPTIME might
not be high enough.
Tuning: Increasing the value from the default of 40 on multiprocessor systems might be of
benefit if there is pthread mutex contention.
Item Descriptor
Purpose: Tunes the number of times it takes to create a VP during activation timeouts.
Values: Default: DEF_STEPTIME. Range: A positive integer.
Display: echo $STEP_TIME
This is turned on internally, so the initial default value will not be seen with the echo
command.
Diagnosis: N/A
Tuning: N/A
Diagnosis: N/A
Tuning: N/A
Item Descriptor
Purpose: Controls the number of times to yield the processor before blocking on a busy lock
(only for libpthreads). The processor is yielded to another kernel thread, assuming
there is another runnable kernel thread with sufficient priority.
Values: Default: 0. Range: A positive value.
Display: echo $YIELDLOOPTIME
This is turned on internally, so the initial default value will not be seen with the echo
command.
Diagnosis: If threads are going to sleep often (lot of idle time), then the YIELDLOOPTIME might
not be high enough.
Tuning: Increasing the value from the default value of 0 may benefit you if you do not want
the threads to go to sleep while they are waiting for locks.
Purpose: Stores a fixed copy of the TZ variable for the length of a process.
Display: $AIX_TZCACHE
Diagnosis: This parameter is not recommended for universal system configuration in the /etc/environment file. Use this
parameter for applications that do not alter the TZ variable, but make frequent time zone requests.
2. EXTSHM
Item Descriptor
Tuning: Setting value to ON, 1SEG or MSEG allows a process to allocate shared memory segments as small as 1 byte, rounded to
the nearest page. This effectively removes the limitation of 11 user shared memory segments. For 32bit processes, the
maximum size of all memory segments is 2.75 GB.
Setting EXTSHM to ON has the same effect as setting the variable to 1SEG. With either setting, any shared memory less
than 256 MB is created internally as a mmap segment, and thus has the same performance implications of mmap. Any
shared memory greater or equal to 256 MB is created internally as a working segment.
If EXTSHM is set to MSEG, all shared memory is created internally as a mmap segment, allowing for better memory
utilization.
Change: LDR_CNTRL={PREREAD_SHLIB | LOADPUBLIC| ...}export LDR_CNTRL Change takes effect immediately in this shell.
Change is effective until logging out of this shell. Permanent change is made by adding the following line to the /etc/
environment file: LDR_CNTRL={PREREAD_SHLIB | LOADPUBLIC| ...}
Tuning: The LDR_CNTRL environment variable can be used to control one or more aspects of the system loader behavior.
You can specify multiple options with the LDR_CNTRL variable. When specifying the option, separate the options with
the '@' sign. An example of specifying multiple options is: LDR_CNTRL=PREREAD_SHLIB@LOADPUBLIC. Specifying the
PREREAD_SHLIB option causes entire libraries to be read as soon as they are accessed. With VMM readahead is tuned, a
library can be read from the disk and be cached in memory by the time the program starts to access its pages. While this
method might use more memory, it might also enhance the performance of programs that use many shared library pages if
the access pattern is non-sequential (for example, Catia).
Specifying the LOADPUBLIC option directs the system loader to load all modules requested by an application into the global
shared library segment. If a module cannot be loaded publicly into the global shared library segment then it is loaded
privately for the application.
Specifying the IGNOREUNLOAD option prevents the application from unloading libraries. This specification might prevent
memory fragmentation and eliminate the overhead incurred when libraries are repeatedly loaded and unloaded. If you do
not specify the IGNOREUNLOAD option, you might end up with two data instances of a module if the module was loaded at
application load time and the module was then requested to be dynamically loaded and unloaded multiple times.
Specifying the USERREGS option tells the system to save all general-purpose user registers across system calls made by an
application. This can be helpful in applications doing garbage collection.
Specifying the MAXDATA option sets the maximum heap size for a process, which includes overriding any maxdata value
that is specified in the executable. The maxdata value is used to set the initial soft data resource limit of the process. For
32-bit programs, a non-zero maxdata value enables the large address-space mode, See Large Program Support. To disable
the large address-space model, specify a maxdata value of zero by setting LDR_CNTRL=MAXDATA=0. For 64-bit programs,
the maxdata value provides a guaranteed maximum size for the data heap of the program. The portion of the address
space reserved for the heap cannot be used by the shmat() or mmap() subroutines, even if an explicit address is provided.
Any value can be specified, but the data area cannot extend past 0x06FFFFFFFFFFFFFF regardless of the maxdata value
specified.
The two additional maxdata options exist to allow finer control based on whether the process is 32-bit or 64-bit. These
additional maxdata options override the MAXDATA option for the corresponding object mode. Specifying the MAXDATA32
option results in identical behavior to MAXDATA except that the value is ignored for 64-bit processes. Specifying the
MAXDATA64 option results in identical behavior to MAXDATA except that the value is ignored for 32-bit processes.
Specifying the PRIVSEG_LOADS option directs the system loader to put dynamically loaded private modules into the
process private segment. This specification might improve the availability of memory in large memory model applications
that perform private dynamic loads and tend to run out of memory in the process heap. If the process private segment lacks
sufficient space, the PRIVSEG_LOADS option has no effect. The PRIVSEG_LOADS option is only valid for 32-bit applications
with a non-zero MAXDATA value.
Specifying the DATA_START_STAGGER=Y option starts the data section of the process at a per-MCM offset that is controlled
by the data_stagger_interval option of the vmo command. The nth large-page data process executed on a given MCM has
its data section start at offset (n * data_stagger_interval * PAGESIZE) % 16 MB. The DATA_START_STAGGER=Y option is
only valid for 64-bit processes on a 64-bit kernel.
Specifying the LARGE_PAGE_TEXT=Y option indicates that the loader might attempt to use large pages for the text segment
of the process. The LARGE_PAGE_TEXT=Y option is only valid for 64 bit processes on a 64 bit kernel.
Specifying the LARGE_PAGE_DATA=M option allocates only enough large pages for the data segment up to the brk value,
rather than the entire segment, which is the behavior when you do not specify the LARGE_PAGE_DATA=M option. Changes
to the brk value might fail if there are not enough large pages to support the change to the brk value.
Specifying the RESOLVEALL option forces the loader to resolve all undefined symbols that are imported at program load
time or when the program loads the dynamic modules. Symbol resolution is performed in the standard AIX depth-first
order. If you specify LDR_CNTRL=RESOLVEALL and the imported symbols cannot be resolved, the program or the dynamic
modules fail to load.
Specifying the HUGE_EXEC option provides user control over the process address space location of the read-only segments
for certain 32-bit executables. For more information see 32-bit Huge Executable.
Specifying the NAMEDSHLIB=name,[attr1],[attr2]...[attrN] option enables a process to access or create a shared
library area that is identified by the name that is specified. You can create a named shared library area with the following
methods:
• With no attributes
• With the doubletext32 attribute, which creates the named shared library area with two segments dedicated to shared
library text
If a process requests the use of a named shared library area that does not exist, the shared library area is automatically
created with the name that is specified. If an invalid name is specified, the NAMEDSHLIB=name,[attr1],[attr2]...
[attrN] option is ignored. Valid names are of positive length and contain only alphanumeric, underscore, and period
characters.
Specifying the SHARED_SYMTAB=Y option causes the system to create a shared symbol table for a 64-bit program, if the
program exports any symbols. If multiple instances of the program run concurrently, using a shared symbol table can
reduce the amount of system memory that is required by the program.
Specifying the SHARED_SYMTAB=N option prevents the system from creating a shared symbol table for a 64-bit program.
This option overrides the AOUT_SHR_SYMTAB flag in the XCOFF auxiliary header.
Specifying the SED option sets the stack execution disable (SED) mode for the process, by ignoring any other SED mode that
is specified by the executable. This option must be set to one of the following values:
SED=system
SED=request
SED=exempt
Purpose: Requests preloading of shared libraries. The LDR_PRELOAD option is for 32-bit processes, and the LDR_PRELOAD64
option is for 64-bit processes. During symbol resolution, the preloaded libraries listed in this variable is searched first for
every imported symbol, and only when it is not found in those libraries will the normal search be used. Preempting of
symbols from preloaded libraries works for both AIX default linking and run-time linking. Deferred symbol resolution is
unchanged.
Change: $LDR_PRELOAD="libx.so:liby.a(shr.o)"
Resolves any symbols needed first from the libx.so shared object, then from the shr.o member of liby.a, and
finally within the process' dependencies. All dynamically loaded modules (modules loaded with dlopen() or load()) will
also be resolved first from the preloaded libraries listed by the variable.
5. NODISCLAIM
Item Descriptor
Purpose: Controls how calls to free() are being handled. When PSALLOC is set to early, all free() calls result in a disclaim() system
call. When NODISCLAIM is set to true, this does not occur.
Diagnosis: If the number of disclaim() system calls is very high, you might want to set this variable.
Tuning: Setting this variable will eliminate calls to the disclaim() option from free() if PSALLOC is set to early.
Purpose: Sets the PSALLOC environment variable to determine the paging-space allocation policy.
Tuning: To ensure that a process is not killed due to low paging conditions, this process can preallocate paging space by using
the Early Page Space Allocation policy. However, this might result in wasted paging space. You might also want to set the
NODISCLAIM environment variable.
Refer to: “Allocation and reclamation of paging space slots ” on page 47 and “Early page space
allocation ” on page 141
8. RT_GRQ
Item Descriptor
Purpose: Causes the thread to be put on a global run queue rather than on a per-CPU run queue.
Tuning: May be tuned on multiprocessor systems. Setting this variable to ON will cause the thread to be put in a global run queue.
In that case, the global run queue is searched to see which thread has the best priority. This might allow the system
to get the thread dispatched sooner and can improve performance for threads that are running SCHED_OTHER and are
interrupt driven.
Purpose: When you are running the kernel in real-time mode (see bosdebug command), an MPC can be sent to a different CPU to
interrupt it if a better priority thread is runnable so that this thread can be dispatched immediately.
10. TZ
Item Descriptor
Tuning: POSIX may be used by applications that are performance sensitive and do not rely on accurate changes to time zone
rules and Daylight Saving Time.
11. VMM_CNTRL
Item Descriptor
MMAP_ANON_PSIZE
Tuning: The VMM_CNTRL environment variable can be used to control the virtual memory manager. You can specify multiple
options by using the VMM_CNTRL environmental variable and by separating the options with the '@' sign. An example to
specify multiple options follows:
VMM_CNTRL=vmm_fork_policy=COW@SHM_1TB_SHARED=5
When you specify the vmm_fork_policy=COW option, the vmm uses the copy-on-write fork-tree policy whenever
a process is forked. This is the default behavior. To prevent the vmm from using the copy-on-write policy, use the
vmm_fork_policy=COR option. If the vmm_fork_policy option is specified, the global vmm_fork_policy tunable is
ignored.
If ESID_ALLOCATOR option is specified, it controls the allocator from undirected shmat and mmap allocations. See “1 TB
Segment Aliasing” on page 146 for detailed information.
If SHM_1TB_SHARED or SHM_1TB_UNSHARED is specified, it controls the use of 1 TB shared memory regions. See “1 TB
Segment Aliasing” on page 146 for detailed information.
If SHM_AUTO_1TB is specified, it controls the autonomic promotion of 1TB segment size for shared memory regions. See
Creating shared memory objects with 1 TB segment size for detailed information.
If the VMM_CNTRL environment variable is set to MMAP_ANON_PSIZE=64K, the anonymous memory regions are
supported by 64 KB page size. This setting affects all the anonymous memory regions that are created for the process for
the duration that the environment variable is set. By default, the anonymous memory regions are supported by 4 KB page
size.
12. AIX_STDBUFSZ
Item Descriptor
Purpose: Configures the I/O buffer size for the read and write system calls generated by cp, mv, cat, and cpio commands. This is
also applicable for stream buffering.
13. AIX_LDSYM
Item Descriptor
Purpose: The source line information in a Lightweight_core file is not displayed by default when the text page size is 64 K.
When the text page size is 64K, use the environment variable AIX_LDSYM=ON to get the source line information in a
Lightweight_core file.
Tuning: Use this parameter for applications which has 64 K text page size and needs source line information in its
Lighweight_core file.
14. AIX_CWD_CACHE
Item Descriptor
Purpose: Disables the caching algorithm that is used by the getcwd and getwd subroutines to retrieve the path name of the
current working directory.
LDR_CNTRL=[...@]HUGE_EXEC={<segno>|0}[,<attribute>][@...]
where segno is either the requested starting segment number in the 32-bit process address space or zero.
If you specify a non-zero segno value, the system loader attempts to insert the huge executable read-only
segments into the process address space at the location corresponding to the requested starting segment
number.
If you specify a zero segno value, the system loader attempts to insert the huge executable read-only
segments into the process address space at the location corresponding to the requested starting segment
number.
If you specify a zero segno value (or in the absence of the HUGE_EXEC option to LDR_CNTRL), the
system loader selects the starting segment number based on the address space model. The algorithm
used to perform this selection is similar to the MAP_VARIABLE flag of the mmap subroutine:
• If neither Dynamic Segment Allocation (DSA) or large page data is requested by the process, the system
chooses the set of consecutive segments immediately following the process heap.
• Otherwise, the system chooses the set of consecutive segments immediately below the lowest shared
library area segment, if any.
The starting segment number must not conflict with any segments already reserved by the requested
process address space model. Determining whether such a conflict exists requires that process heap
and shared library area segments, if any, are taken into account. In cases where the process heap
An optional attribute to the HUGE_EXEC loader control option allows you to request that the shared
library text segment be placed into segment 0x1 rather than 0xD:
HUGE_EXEC={<segno>|0},shtext_in_one
Since the shared library area's pre-relocated data is useful only when the shared text segment resides
in segment 0xD, processes that request this option do not have the benefit of pre-relocated library data.
Consequently, any shared library data resides in the process heap. This has the benefit of freeing up all of
segments 0x3–0xF for divided use by the process heap (mmap/shmat), and the executable.
Note: The shtext_in_one attribute used in conjunction with maxdata and DSA settings
that would normally preclude a process from utilizing a shared library area (for example,
maxdata>0xA0000000/dsa or maxdata=0/dsa), allows the process to take advantage of the
performance benefits that shared library text provides.
If the process's shared library area is a named area created with the doubletext32 attribute, then there
is no pre-relocated data segment and both shared library text segments must be used. In this case,
the primary segment (normally located in segment 0xD) is moved to segment 0x1 and the secondary
shared library text segment remains in segment 0xF. This maximizes the number of consecutive segments
(0x3–0xE) that can be divided for use by the process heap (mmap/shmat), and the executable.
While non-huge executables with maxdata values greater than 0xA0000000 and DSA enabled are
prevented from using shared library areas in all cases, a huge executable that (1) uses a named shared
library area created with the doubletext32 attribute; and (2) specifies the shtext_in_one attribute, can
request a maxdata value of up to 0xC0000000 before forfeiting accessibility to the area.
You can see from this that segments 0x4–0xA are available for the executable.
Assuming that the executable size is greater than 256 MB and less than 512 MB, ideal HUGE_EXEC
settings for this situation are as follows:
1. HUGE_EXEC=0x8
2. HUGE_EXEC=0x9
Option 1 would insert the executable into segments 0x8–0x9, while option 2 would insert the executable
into segments 0x9–0xA.
Note: A HUGE_EXEC=0 setting would not be appropriate for this customer since the system would choose
segments 0xB–0xC for the executable (because of DSA). This would prevent those segments from being
available for shmat/mmap after the exec. Setting HUGE_EXEC to any of 0x4, 0x5, 0x6, or 0x7 segments,
while allowing the insertion to occur as requested, would result in limiting process heap growth to the
segment just below the requested starting segment.
Very large program address space model without access to shared library area
example
If your preferred address space model is as follows:
• MAXDATA=0xB0000000 DSA
• No shmat/mmap regions
• No shared library area accessibility
You can see from this that segments 0x4–0xF are available for the executable.
Assuming that the executable size is greater than 256 MB and less than 512 MB, ideal HUGE_EXEC
settings for this situation are as follows:
1. HUGE_EXEC=0
2. HUGE_EXEC=0xE
Both options would insert the executable into segments 0xE–0xF.
Note: Setting a HUGE_EXEC to any of the 0x4-0xD segments, while allowing the insertion to occur as
requested, would result in limiting process heap growth to the segment just below the requested starting
segment.
You can see from this that segments 0x3–0xC are available for the executable.
Assuming that the executable size is greater than 256 MB and less than 512 MB, ideal HUGE_EXEC
settings for this situation are as follows:
1. HUGE_EXEC=0
2. HUGE_EXEC=0x3
...
10. HUGE_EXEC=0xB
Options 1 and 2 have identical results – inserting the executable into segments 0x3–0x4.
You can see from this that segments 0xA–0xB are available for the executable.
Assuming that the executable size is greater than 256 MB and less than 512 MB, ideal HUGE_EXEC
settings for this situation are as follows:
1. HUGE_EXEC=0,shtext_in_one
2. HUGE_EXEC=0xA,shtext_in_one
Both options would insert the executable into segments 0xA–0xB and shared library text into segment
0x1.
Note: Setting a HUGE_EXEC to any of 0xB–0xE, while allowing the insertion to occur as requested, would
prevent some of segments 0xC–0xF from being available for shmat/mmap after the executable.
You can see from this that segments 0xC–0xE are available for the executable.
Item Descriptor
Purpose: Used to explicitly include or exclude a process from ASO optimization.
Values: Default: ASO optimizes a process if it satisfies the ASO optimization criteria.
Possible Values: ALWAYS, or NEVER
ALWAYS - ASO prioritizes this process.
NEVER - ASO does not optimize this process.
Diagnosis: N/A
Tuning: N/A
Modifications
AIX provides a flexible and centralized mode for setting most of the AIX kernel tuning parameters.
It is now possible to make permanent changes without editing any rc files. This is achieved by placing
the reboot values for all tunable parameters in a new /etc/tunables/nextboot stanza file. When the
machine is rebooted, the values in that file are automatically applied.
The /etc/tunables/lastboot stanza file is automatically generated with all the values that were set
immediately after the reboot. This provides the ability to return to those values at any time. The /etc/
tunables/lastboot.log log file records any changes made or that could not be made during reboot.
The following commands are available to modify the tunables files:
All of the above commands work on both current and reboot tunables parameters values. For more
information, see their respective man pages.
For more information about any of these kernel tuning parameter modifications, see the Kernel Tuning
section in Performance Tools Guide and Reference.
The previous
vmtune option Usage New command
-C 0|1 page coloring vmo -r -o pagecoloring=0|1
-g n1 -L n2 large page size vmo -r -o lgpg_size=n1 -o lgpg_regions=n2
number of large
pages to reserve
-v n number of frames vmo -r -o framesets=n
per memory pool
-i n interval for special vmo -r -o spec_dataseg_int=n
data segment
identifiers
-V n number of special vmo -r -o num_spec_dataseg=n
data segment
identifiers to reserve
-y 0|1 p690 memory vmo -r -o memory_affinity=0|1
affinity
The vmtune and schedtune compatibility scripts do not ship with AIX. You can refer to the following
tables to migrate your settings to the new commands:
The
vmtune
option The vmo equivalent The ioo equivalent Function
-b number -o numfsbuf=number Sets the number of file system
bufstructs.
-B number -o hd_pbuf_cnt=number This parameter has been
replaced with the pv_min_pbuf
parameter.
-c number -o numclust=number Sets the number of 16 KB
clusters processed by write
behind.
-C 0|1 -r -o pagecoloring=0|1 Disables or enables page coloring
for specific hardware platforms.
-d 0|1 -o deffps=0|1 Turns deferred paging space
allocation on and off.
-e 0|1 -o jfs_clread_enabled=0|1 Controls whether JFS uses
clustered reads on all files.
# lsattr -E -l sys0
When the compatibility mode is disabled, the following no command parameters, which are all of type
Reboot, which means that they can only be changed during reboot, cannot be changed without using the
-r flag:
• arptab_bsiz
• arptab_nb
• extendednetstats
• ifsize
• inet_stack_size
• ipqmaxlen
• nstrpush
• pseintrstack
# tunrestore -r -f lastboot
This copies the content of the lastboot file to the nextboot file. For details about the tuning mode, see
the Kernel tuning section in the Performance Tools Guide and Reference.
Item Descriptor
Purpose: Specifies the maximum number of processes per user ID.
Values: Default: 40; Range: 1 to 131072
Display: lsattr -E -l sys0 -a maxuproc
Change: chdev -l sys0 -a maxuproc=NewValueChange takes effect immediately and is
preserved over boot. If value is reduced, then it goes into effect only after a system
boot.
Diagnosis: Users cannot fork any additional processes.
Tuning: This is a safeguard to prevent users from creating too many processes.
2. Tuning the ncargs parameter:
Item Descriptor
Purpose: Specifies the maximum allowable size of the ARG/ENV list (in 4 KB blocks) when
running exec() subroutines.
Values: Default: 256; Range: 256 to 1024
Display: lsattr -E -l sys0 -a ncargs
Change: chdev -l sys0 -a ncargs=NewValue Change takes effect immediately and is
preserved over boot.
Item Descriptor
Purpose: Specifies the maximum number of pending I/Os to a file.
Values: Default: 8193; Range: 0 to n (n should be a multiple of 4, plus 1)
Display: lsattr -E -l sys0 -a maxpout
Change: chdev -l sys0 -a maxpout=NewValueChange is effective immediately and is
permanent. If the -T flag is used, the change is immediate and lasts until the
next boot. If the -P flag is used, the change is deferred until the next boot and is
permanent.
Item Descriptor
Purpose: Specifies the point at which programs that have reached maxpout can resume writing
to the file.
Values: Default: 4096; Range: 0 to n (n should be a multiple of 4 and should be at least 4 less
than maxpout)
Display: lsattr -E -l sys0 -a minpout
Change: chdev -l sys0 -a minpout=NewValueChange is effective immediately and is
permanent. If the -T flag is used, the change is immediate and lasts until the
next boot. If the -P flag is used, the change is deferred until the next boot and is
permanent.
Diagnosis: If the foreground response time sometimes deteriorates when programs with large
amounts of sequential disk output are running, disk I/O might need to be paced more
aggressively. If sequential performance deteriorates unacceptably, I/O pacing might
need to be decreased or disabled.
Tuning: If the foreground performance is unacceptable, decrease the values of both maxpout
and minpout. If sequential performance deteriorates unacceptably, increase one or
both, or set them both to 0 to disable I/O pacing.
4. mount -o nointegrity
Item Descriptor
Purpose: A new mount option (nointegrity) might enhance local file system performance for
certain write-intensive applications. This optimization basically eliminates writes to
the JFS log. Note that the enhanced performance is achieved at the expense of
metadata integrity. Therefore, use this option with extreme caution because a system
crash can make a file system mounted with this option unrecoverable. Nevertheless,
certain classes of applications do not require file data to remain consistent after a
system crash, and these may benefit from using the nointegrity option. Two examples
in which a nointegrity file system may be beneficial is for compiler temporary files,
and for doing a nonmigration or mksysb installation.
5. Paging Space Size
Item Descriptor
Purpose: The amount of disk space required to hold pages of working storage.
Values: Default: configuration-dependent; Range: 32 MB to n MB for hd6, 16 MB to n MB for
non-hd6
Display: lsps -a mkps or chps or smitty pgsp
Change: Change is effective immediately and is permanent. Paging space is not necessarily put
into use immediately, however.
Item Descriptor
Purpose: The time between sync() calls by syncd.
Values: Default: 60; Range: 1 to any positive integer
Display: grep syncd /sbin/rc.boot vi /sbin/rc.boot or
Change: Change is effective at next boot and is permanent. An alternate method is to use the
kill command to terminate the syncd daemon and restart it from the command line
with the command /usr/sbin/syncd interval.
Diagnosis: I/O to a file is blocked when syncd is running.
Tuning: At its default level, this parameter has little performance cost. No change is
recommended. Significant reductions in the syncd interval in the interests of data
integrity (as for HACMP) could have adverse performance consequences.
Item Description
minservers Indicates the minimum number of kernel processes per processor dedicated to AIO
processing. Because each kernel process uses memory, the minservers tunable
value, when multiplied by the number of processors, must not be large when the
amount of AIO expected is small. The default value for the minservers tunable is 3.
maxservers Indicates the maximum number of kernel processes per processor that are
dedicated to AIO processing. This tunable value, when multiplied by the number
of processors, indicates the limit on slow path I/O requests in progress at one time
and represents the limit for possible I/O concurrency. The default value for the
maxservers tunable is 30.
maxreqs Indicates the maximum number of AIO requests that can be outstanding at one time.
The requests include those that are in progress, as well as those that are waiting
to be started. The maximum number of AIO requests cannot be less than the value
of AIO_MAX, as defined in the /usr/include/sys/limits.h file, but it can be
greater. It is appropriate for a system with a high volume of AIO to have a maximum
number of AIO requests larger than AIO_MAX. The default value for the maxreqs
tunable is 16384.
Item Description
Purpose: Maximum number of requests that can be outstanding on a SCSI bus. (Applies only to
the SCSI-2 Fast or Wide Adapter.)
Values: Default: 40; Range: 40 - 128
Display: lsattr -E -l scsin -a num_cmd_elems
Change: chdev -l scsin -a num_cmd_elems=NewValue
Change is effective immediately and is permanent. If the -T flag is used, the change is
immediate and lasts until the next boot. If the -P flag is used, the change is deferred
until the next boot and is permanent.
Diagnosis: Applications performing large writes to striped raw logical volumes are not obtaining
the desired throughput rate.
Tuning: Value should equal the number of physical drives (including those in disk arrays) on
the SCSI bus, times the queue depth of the individual drives.
2. Disk Drive Queue Depth
Item Description
Purpose: Maximum number of requests the disk device can hold in its queue.
Values: Default: IBM disks=3; Non-IBM disks=0; Range: specified by manufacturer
Display: lsattr -E -l hdiskn
Change: chdev -l hdiskn -a q_type=simple -a queue_depth=NewValue
Change is effective immediately and is permanent. If the -T flag is used, the change is
immediate and lasts until the next boot. If the -P flag is used, the change is deferred
until the next boot and is permanent.
Diagnosis: N/A
Tuning: If the non-IBM disk drive is capable of request-queuing, make this change to ensure
that the operating system takes advantage of the capability.
Item Description
Purpose: Maximum number of pending requests in a Fibre Channel adapter.
Values: Default: 1024; Range: 200 - 4096
To change this attribute immediately, the fcsn adapter must be in a defined state.
Otherwise, the -P flag is used to change the attribute. The -P flag defers the change
until the next boot operation and this change is permanent.
For Fiber Channel adapters that have data rates of 16 Gbps or more, the fcsn
adapter do not need to be in defined state. You can update the num_cmd_elems
attribute by using the following command:
Note: The default value and the range value vary for each Fibre Channel device.
For some Fibre Channel and Fibre Channel over Ethernet (FC/FCoE) adapters, the
maximum value of the num_cmd_elems parameter that can be set might be less than
the maximum range mentioned in the Object Data Manager (ODM). If the specified
value for the num_cmd_elems parameter of the chdev command is larger than the
value supported by the FC/FCoE adapters, an error message is logged for these
adapters.
Tuning: To get optimum performance, set the value of the num_cmd_elems parameter to the
maximum supported range.
4. fast_lnk_recov
Item Description
Purpose: To control Quick I/O failure for a Fibre Channel link failure case, for Fiber Channel
adapters that has data rates equal to or more than 16Gbps.
Values: The following values are supported:
Yes
The Fibre Channel SCSI protocol driver quickly fails over I/O operations to the
SCSI device driver when the host or storage link is down. This failover operation
helps the SCSI device driver to switch I/O operations to alternative paths.
No
Disables the Quick I/O failure for FC link failure attribute. This is the default value.
Item Descriptor
Purpose: Specifies maximum number of bytes on queue.
Tuning: For tuning and other information, see the ipc_msgmnb tunable parameter that is
managed by the vmo command.
3. Tuning the msgmni parameter:
Item Descriptor
Purpose: Specifies maximum number of message queue IDs.
Values: Dynamic with maximum value of 131072
Display: N/A
Change: N/A
Diagnosis: N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
4. Tuning the msgmnm parameter:
Item Descriptor
Purpose: Specifies maximum number of messages per queue.
Values: Dynamic with maximum value of 524288
Display: N/A
Change: N/A
Diagnosis: N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
5. Tuning the semaem parameter:
Item Descriptor
Purpose: Specifies maximum value for adjustment on exit.
Values: Dynamic with maximum value of 16384
Display: N/A
Change: N/A
Diagnosis: N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
6. Tuning the semmni parameter:
Item Descriptor
Purpose: Specifies maximum number of semaphores per ID.
Values: Dynamic with maximum value of 65535
Display: N/A
Change: N/A
Diagnosis: N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
8. Tuning the semopm parameter:
Item Descriptor
Purpose: Specifies maximum number of operations per semop() call.
Values: Dynamic with maximum value of 1024
Display: N/A
Change: N/A
Diagnosis: N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
9. Tuning the semume parameter:
Item Descriptor
Purpose: Specifies maximum number of undo entries per process.
Values: Dynamic with maximum value of 1024
Display: N/A
Change: N/A
Diagnosis: N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
10. Tuning the semvmx parameter;
Item Descriptor
Purpose: Specifies maximum value of a semaphore.
Values: Dynamic with maximum value of 32767
Display: N/A
Change: N/A
Item Descriptor
Purpose: Specifies maximum shared memory segment size.
Values: Dynamic with maximum value of 256 MB for 32-bit processes and 0x80000000u for
64-bit
Display: N/A
Change: N/A
Diagnosis: N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
12. Tuning the shmmin parameter:
Item Descriptor
Purpose: Specifies minimum shared-memory-segment size.
Values: Dynamic with minimum value of 1
Display: N/A
Change: N/A
Diagnosis: N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
13. Tuning the shmmni parameter:
Item Descriptor
Purpose: Specifies maximum number of shared memory IDs.
Values: Dynamic with maximum value of 1048576
Display: N/A
Change: N/A
Diagnosis: N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
Diagnosis: N/A
Tuning: If maxmbuf is greater than 0, the maxmbuf value is used regardless of the value
of thewall. The upper limit on mbufs is the higher value of maxmbuf or thewall.
Refer to: “netstat -m command to monitor mbuf pools” on page 267
2. MTU
Item Descriptor
Purpose: Limits the size of packets that are transmitted on the network.
Values: Default: configuration-dependent
Display: lsattr -E -l interface_name
Change: chdev -l interface_name -a mtu=NewValue
With the chdev command, the interface cannot be changed while it is in use.
Change is effective across reboots. An alternate method is as follows: ifconfig
interface_name mtu NewValueThis changes the MTU size on a running system,
but will not preserve the value across a system reboot.
Item Descriptor
Purpose: Enables TCP enhancements as specified by RFC 1323 (TCP Extensions for High
Performance). Value of 1 indicates that tcp_sendspace and tcp_recvspace can
exceed 64 KB.
Values: Default: 0; Range 0 to 1
Display: lsattr -El interface or ifconfig interface
Change: ifconfig interface rfc1323 NewValueOR chdev -l interface -a
rfc1323=NewValue
The ifconfig command sets values temporarily, making it useful for testing.
The chdev command alters the ODM, so custom values return after system
reboots.
Diagnosis: N/A
Item Descriptor
Purpose: Default maximum segment size used in communicating with remote networks.
Values: Default: 512 bytes
Display: lsattr -El interface or ifconfig interface
Change: ifconfig interface tcp_mssdflt NewValueOR chdev -l interface -a
tcp_mssdflt=NewValue
The ifconfig command sets values temporarily, making it useful for testing.
The chdev command alters the ODM, so custom values return after system
reboots.
Diagnosis: N/A
Tuning: tcp_mssdflt is used if path MTU discovery is not enabled or path MTU discovery
fails to discover a path MTU. Limiting data to (MTU - 52) bytes ensures that,
where possible, only full packets will be sent. This is a run-time attribute.
Refer to: “TCP Maximum Segment Size tuning” on page 262
5. tcp_nodelay
Item Descriptor
Purpose: Specifies that sockets using TCP over this interface follow the Nagle algorithm
when sending data. By default, TCP follows the Nagle algorithm.
Values: Default: 0; Range: 0 or 1
Display: lsattr -El interface or ifconfig interface
Change: ifconfig interface tcp_nodelay NewValueOR chdev -l interface -a
tcp_nodelay=NewValue
The ifconfig command sets values temporarily, making it useful for testing.
The chdev command alters the ODM, so custom values return after system
reboots.
Diagnosis: N/A
Tuning: This is an Interface-Specific Network Option (ISNO) option.
Refer to: “Interface-Specific Network Options” on page 245
6. tcp_recvspace
Item Descriptor
Purpose: Specifies the system default socket buffer size for receiving data. This affects the
window size used by TCP.
Values: Default: 16384 bytes
Diagnosis: N/A
Tuning: Setting the socket buffer size to 16 KB (16 384) improves performance over
standard Ethernet and Token-Ring networks. The default value is 16 384. Lower
bandwidth networks, such as Serial Line Internet Protocol (SLIP), or higher
bandwidth networks, such as Serial Optical Link, should have different optimum
buffer sizes. The optimum buffer size is the product of the media bandwidth and
the average round-trip time of a packet.
The tcp_recvspace attribute must specify a socket buffer size less than or equal
to the setting of the sb_max attribute. This is a dynamic attribute, but for
daemons started by the inetd daemon, run the following commands:
• stopsrc-s inetd
• startsrc -s inetd
Item Descriptor
Purpose: Specifies the system default socket buffer size for sending data.
Values: Default: 16384 bytes
Display: lsattr -El interface or ifconfig interface
Change: ifconfig interface tcp_sendspace NewValueOR chdev -l interface -a
tcp_sendspace=NewValue
The ifconfig command sets values temporarily, making it useful for testing.
The chdev command alters the ODM, so custom values return after system
reboots.
Diagnosis: N/A
Tuning: This affects the window size used by TCP. Setting the socket buffer size
to 16 KB (16 384) improves performance over standard Ethernet and Token-
Ring networks. The default value is 16 384. Lower bandwidth networks,
such as Serial Line Internet Protocol (SLIP), or higher bandwidth networks,
such as Serial Optical Link, should have different optimum buffer sizes.
The optimum buffer size is the product of the media bandwidth and
the average round-trip time of a packet: optimum_window=bandwidth *
average_round_trip_time
The tcp_sendspace attribute must specify a socket buffer size less than or
equal to the setting of the sb_max attribute. The tcp_sendspace parameter
is a dynamic attribute, but for daemons started by the inetd daemon, run the
following commands:
• stopsrc-s inetd
• startsrc -s inetd
Item Descriptor
Purpose: Specifies whether send buffer pools should be used for sockets.
Values: Default: 1
Display: netstat -m
Change: This option can be enabled by setting the value to 1 or disabled by setting the
value to 0.
Diagnosis: N/A
Tuning: It is a load time, boolean option.
9. xmt_que_size
Item Descriptor
Purpose: Specifies the maximum number of send buffers that can be queued up for the
interface.
Values: Default: configuration-dependent
Display: lsattr -E -l interface_name
Change: ifconfig interface_name detach chdev -l interface_name
-aque_size_name=NewValue ifconfig interface_name hostname up.
Cannot be changed while the interface is in use. Change is effective across
reboots.
Item Descriptor
Purpose: Enables commit-behind behavior on the NFS client when writing very large files
over NFS Version 3 mounts.
Values: Default: 0; Range: 0 to 1
Display: mount
Change: mount -o combehind
Diagnosis: Poor throughput when writing very large files (primarily files larger than the
amount of system memory in the NFS client) over NFS Version 3 mounts.
Item Descriptor
Purpose: Specifies the maximum number of NFS server threads that are created to service
incoming NFS requests.
Values: Default: 3891; Range: 1 to 3891
Display: ps -efa | grep nfsd
Change: chnfs -n NewValue Change takes effect immediately and is permanent. The -N
flag causes an immediate, temporary change. The -I flag causes a change that
takes effect at the next boot.
Diagnosis: See nfs_max_threads
Tuning: See nfs_max_threads
Refer to: “Number of necessary biod threads” on page 315
3. numclust
Item Descriptor
Purpose: Used in conjunction with the combehind option to improve write throughput
performance when writing large files over NFS Version 3 mounts.
Values: Default: 128; Range: 8 to 1024
Display: mount
Diagnosis: Poor throughput when writing very large files (primarily files larger than the
amount of system memory in the NFS client) over NFS Version 3 mounts.
Tuning: Use this mount option on the NFS client if the primary use of NFS is to write
very large files to the NFS server. The value basically represents the minimum
number of pages for which VMM will attempt to generate a commit operation
from the NFS client. Too low a value can result in poor throughput due to an
excessive number of commits (each of which results in synchronous writes
on the server). Too high a value can also result in poor throughput due to
the NFS client memory filling up with modified pages which can cause the
LRU daemon to be invoked to start reclaiming pages. When the lrud runs,
V3 writes essentially become synchronous because each write ends up being
accompanied by a commit. This situation can be avoided by using the numclust
and combehind options.
The information in this how-to scenario was tested using specific versions of AIX. The results you obtain
might vary significantly depending on your version and level of AIX.
Assume the system is running an application that sequentially writes very large files (larger than the
amount of physical memory on the machine) to an NFS-mounted file system. The file system is mounted
using NFS V3. The NFS server and client communicate over a 100 MB per second Ethernet network.
When sequentially writing a small file, the throughput averages around 10 MB per second. However, when
writing a very large file, the throughput average drops to well under 1 MB per second.
The application's large file write is filling all of the client's memory, causing the rate of transfer to the NFS
server to decrease. This happens because the client AIX system must invoke the LRUD kproc to release
some pages in memory to accommodate the next set of pages being written by the application.
Use either of the following methods to detect if you are experiencing this problem:
• While a file is being written to the NFS server, run the nfsstat command periodically (every 10
seconds) by typing the following:
nfsstat
Check the nfsstat command output. If the number of V3 commit calls is increasing nearly linearly with
the number of V3 write calls, it is likely that this is the problem.
• Use the topas command (located in the bos.perf.tools fileset) to monitor the amount of data per
second that is being sent to the NFS server by typing the following:
topas -i 1
If either of the methods listed indicate that the problem exists, the solution is to use the new mount
command option called combehind when mounting the NFS server file system on the client system. Do
the following:
1. When the file system is not active, unmount it by typing the following:
unmount /mnt
The information in this how-to scenario was tested using specific versions of AIX. The results you obtain
might vary significantly depending on your version and level of AIX.
The scenario environment consists of one 2-way system used as a mail server. Mail is received remotely
through POP3 (Post Office Protocol Version 3) and by local mail client with direct login on the server. Mail
is sent by using the sendmail daemon. Because of the nature of a mail server, a high number of security
subroutines are called for user authentication. After moving from a uniprocessor machine to the 2-way
system, the uptime command returned 200 processes, compared to less than 1 on the uniprocessor
machine.
To determine the cause of the performance degradation and reduce the amount of processor time used
for security subroutines, do the following:
1. Determine which processes consume a high percentage of processor time and whether the majority
of processor time is spent in kernel or user mode by running the following command (located in the
bos.perf.tools fileset):
topas -i 1
The topas command output in our scenario indicated that the majority of processor time, about 90%,
was spent in user mode and the processes consuming the most processor time were sendmail and
pop3d. (Had the majority of processor usage been kernel time, a kernel trace would be the appropriate
tool to continue.)
2. Determine whether the user-mode processor time is spent in application code (user) or in shared
libraries (shared) by running the following command to gather data for 60 seconds:
The tprof command lists the names of the subroutines called out of shared libraries and is sorted by
the number of processor ticks spent for each subroutine. The tprof data, in this case, showed most of
the processor time in user mode was spent in the libc.a system library for the security subroutines
(and those subroutines called by them). (Had the tprof command shown that user-mode processor
time was spent mostly in application code (user), then application debugging and profiling would have
been necessary.)
3. To avoid having the /etc/passwd file scanned with each security subroutine, create an index for it by
running the following command:
mkpasswd -f
By using an indexed password file, the load average for this scenario was reduced from a value of 200 to
0.6.
Example
The following example shows that an application can query the amount of BSR memory available on
a system and then allocate and attach a BSR shared memory region. The example shows how an
application would detach and delete a BSR shared memory region.
#include <errno.h>
#include <stdio.h>
#include <sys/shm.h>
#include <sys/vminfo.h>
if (id == -1)
{
perror("shmget() failed");
return 3;
}
return 0;
}
# LD_LIBRARY_PATH=/usr/opt/zlibNX/lib:$LD_LIBRARY_PATH <application>
# LIBPATH=/usr/opt/zlibNX/lib:$LIBPATH <application>
A disadvantage of setting the LD_LIBRARY_PATH variable is that the loader attempts to find all the
required libraries in the specified directory first.
If the applications load the libz.so.1 library dynamically by using the dlopen subroutine, you must
update the application with the path to the hardware-accelerated zlib library.
Initializes the compression operation. You can specify the following parameters with this function:
level
You can specify one of the following values for the level parameter:
Z_NO_COMPRESSION
Compression falls back to software-based compression.
Z_BEST_SPEED
Currently, this level is considered as the Z_DEFAULT_COMPRESSION level, and it is the default
level.
Z_BEST_COMPRESSION
Levels 1 - 9 are treated as level 6, that is, Z_DEFAULT_COMPRESSION level, which is the
default level.
method
You can specify one of the following values for the method parameter:
Z_DEFLATED
If the method parameter has a value other than Z_DEFLATED, compression falls back to the
software-based decompression.
windowBits
Indicates the size of the internal history buffer. Accepts value of the base 2 logarithm. The size
of the history buffer (window size) can be in the range 256 bytes - 32 KB. The windowBits
parameter impacts memory that is allocated for the internal input memory buffer. You can specify
values in the following range for the windowBits parameter:
8 - 15
Indicates the zlib format.
Initializes the compression dictionary with values specified in byte sequence (dictionary
parameter). The compression dictionary must consist of strings that might be encountered later in
the data that must be compressed.
deflateGetDictionary
Returns sliding dictionary (history buffer) that is maintained by the deflate function. Sliding
dictionary is a memory buffer containing uncompressed data.
deflateCopy
Copies a stream and duplicates the internal compression state. The destination stream might not be
able to access the accelerator.
deflateReset
deflateReset(z_streamp strm)
Resets the state of a stream without freeing and reallocating internal compression state.
deflateParams
Supports the deflate function in the zlibNX library based on input parameters. Compression level
does not change.
deflateTune
Refines the internal compression parameters of the deflate function. The compression operation
falls back to software-based compression.
deflateBound(z_streamp strm, uLong sourceLen)
Returns the maximum size of the compressed data after the deflate operation is performed on the
sourceLen bytes that is passed to the deflateBound function.
deflatePending
Returns the output that is generated but not yet displayed in the output of the compression operation.
deflatePrime
Initializes the decompression dictionary from the specified decompressed byte sequence.
inflateGetDictionary
inflateSync(z_streamp strm)
Skips compressed data during the inflate operation until a possible full flush point is reached. If a full
flush point is not reached, the inflate operation skips all the available input data. A full flush point
refers to a location in compressed data that was generated when the deflate() function was called
with the flush parameter set to the Z_FULL_FLUSH value.
inflateCopy
inflateReset(z_streamp strm)
Resets the compression state but does not free the compression state.
inflateReset2
The inflateReset2 function is similar to the inflateReset function, but it also allows changing
the wrap and window size values. The windowBits parameter is used similarly as it is used in the
inflateInit2 function. Size of the history buffer (window size) is static at 32 KB, but the format that
is indicated by the windowBits parameter is used for decompression.
inflatePrime
Inserts bits in the inflate input stream. This function is used to start the inflate operation at a bit
position within a byte. The compression operation falls back to software-based decompression.
inflateMark(z_streamp strm)
Marks location to randomly access data from input stream, which is specified as bit positions in the
data stream. Returns two 16-bit values that are upper and lower halves of 32-bit input data.
inflateGetHeader
inflateBackInit()
Initializes the internal stream for the decompression operation by calling the inflateBack function.
The decompression operation falls back to software-based decompression. This function is supported
only at the start of a stream.
inflateBack
Performs a raw inflate operation by using a call-back interface for input and output data. The
decompression operation falls back to software-based decompression. This function must be called
only after calling the inflateBackInit function.
inflateBackEnd
inflateBackEnd(z_streamp strm)
Free all memory that is allocated by the inflateBackInit function. The decompression operation
falls back to software-based decompression.
Nest accelerators
The POWER9™ processor-based servers support on-chip accelerators that perform various functions such
as compression, decompression, encryption, and decryption of data. This accelerator can implement
GZIP functions, for example, GZIP compression and decompression operations.
Applications in an AIX logical partition can access the GZIP accelerator by using the zlib interface.
Starting from IBM AIX 7.2 with Technology Level 4, applications can access GZIP accelerator by using the
zlibNX library. Based on environment variables and availability of a GZIP accelerator, the zlibNX library
uses the hardware accelerated compression and decompression functions, when appropriate.
The hardware accelerators are shared between multiple logical partitions. Therefore, a flow control
mechanism is required to prevent the logical partitions from using more than the allocated amount of the
accelerator. This flow control mechanism is based on an allocation unit called credit. Credits are allocated
to each logical partition. Before the zlibNX library can send an access request to the GZIP accelerator
for an application, the zlibNX library must open a communication channel with the GZIP accelerator. An
available credit is required to open a communication channel with the GZIP accelerator. Otherwise, the
request fails.
An accelerator request on a communication channel takes up the associated credit and the credit is
restored when the accelerator operation is complete. Thus, only one operation can be active at a time on
a communication channel.
Each logical partition is allocated the following types of credits:
• Each logical partition has a number of default credits based on the CPU entitlement of the logical
partition.
nx_config_query() subroutine
Purpose
Returns the configuration information about a specific accelerator type.
Note: Currently, NX_GZIP_TYPE is the only accelerator type that is supported. You must specify
NX_GZIP_TYPE in the accel_type parameter.
Description
The nx_config_query subroutine copies the configuration data from the requested type of accelerator
into the memory buffer that is defined by the buf_address and buf_size variables.
If you want the applications to control the credits information for the various units of the type of
accelerator, the nx_config_query subroutine can return the configuration data for each accelerator
unit.
The structure of the data in the memory buffer is accelerator type-dependent but always starts with the
following common structure:
struct nx_config_com {
uint32_t ncc_version; /* version number */
uint32_t ncc_res1; /* Reserved - padding */
uint64_t ncc_gencount; /* Generation count */
#ifdef __64BIT__
nx_accel_unit_t *ncc_accel_buf_addr; /* unit info array address */
#else
uint32_t ncc_res2; /* Not used - must be 0 */
nx_accel_unit_t *ncc_accel_buf_addr; /* unit info array address */
#endif
uint32_t ncc_accel_buf_size; /* unit info array size in bytes */
uint32_t ncc_avail_for_use_credits; /* Total number of credits the *
* caller can potentially access *
* to send work to the accel. */
uint32_t ncc_avail_for_res_credits; /* Max number of credits the *
* caller can reserve for *
* exclusive use. *
* (0 for non-privileged callers)*/
uint32_t ncc_total_num_units; /* Total number of accelerator units */
uint32_t ncc_num_units_in_buf; /* # of units described in buffer */
uint32_t ncc_res3[17]; /* Reserved for future extension */
};
For GZIP accelerator, this structure is embedded into a GZIP-specific data structure:
The <sys/nx.h> header file defines the following fields to facilitate access to the ngc_com structure
that is embedded in the nx_gzip_config structure:
The ngc_num_units_in_buf field contains the number of elements in the accelerator unit array. This
number can be less than the ngc_total_num_units value if the ngc_accel_buf_size value is not sufficient
for all the data.
In any structure, the avail_for_res credit value represents the number of quality of service (QoS) credits
that a privileged application can reserve for exclusive use. These values (at the accelerator type and
accelerator unit level) are 0 if the calling application is not a privileged application. The avail_for_use
credit value represents the number of credits that the calling application can access. If the calling
application is a privileged application and contains reserved credits for exclusive use, the avail_for_use
credit value is the number of QoS credits in the exclusive pool of the caller application. Otherwise, the
avail_for_use credit value is the sum of all the default credits and the balance of the QoS credits in the
shared pool that is not reserved for exclusive access.
Return values
0
The configuration of the accelerator has changed and the configuration data is copied to the memory
buffer.
1
The configuration of the accelerator has not changed and the configuration data is not copied to the
memory buffer.
-1
An error is detected. The errno variable is set to indicate the type of error:
• ENOTSUP: Nest accelerators are not available to the logical partition.
nx_get_exclusive_access() subroutine
Purpose
Reserves a specific number of quality of service (QoS) credits for exclusive use.
Syntax
#include <sys/nx.h>
int nx_get_excl_access(nx_accel_type_t accel_type,
uint32_t flags,
int number_of_credits,
nx_unit_id_t unit_id);
Note: Currently, NX_GZIP_TYPE is the only accelerator type that is supported. You must specify
NX_GZIP_TYPE in the accel_type parameter.
Description
A privileged application can use this subroutine to reserve a specific number of QoS credits for exclusive
use. The credits can be requested from a specific accelerator unit that can be identified by its unit ID.
The unit_id value is in the configuration data of each accelerator unit section that is returned by the
nx_config_query subroutine.
A special value, NX_ANY_UNIT, can be used instead of the unit_id value if the application has no
preference as to where the credits are allocated. Another special value, NX_ALL_CREDITS, can be used
to request all the accelerator credits that are allocated to the specified accelerator unit. The special value,
NX_ALL_CREDITS, can also be used along with NX_ANY_UNIT value to request all the accelerator credits
that are allocated to the logical partition.
A privileged application that uses the zlib interface uses this subroutine to directly reserve a number of
GZIP accelerator credits. The subsequent data compression and decompression requests use this pool of
reserved QoS credits.
While checking, if this subroutine determines that the calling application has either root or PV_KER_NXFR
privileges, the subroutine returns -1 and otherwise, returns EPERM. When this subroutine is run
successfully, the subroutine returns a positive value that indicates the number of allocated credits.
If a specific number of credits are requested and sufficient QoS credits are not available to satisfy the
request, the request fails, and no credits are allocated. If the NX_ALL_CREDITS value is requested, the
subroutine returns the number of QoS credits that are allocated. If the QoS credits are not available, it
returns an error. Requesting 0 credits is also flagged as an error.
The reserved credits are released through the nx_rel_exclusive_access subroutine or released
automatically when the application exits.
Return values
>0
Indicates success. The value indicates the number of credits that are allocated for exclusive access.
0
This subroutine does not return this value.
-1
An error is detected. The errno variable is set to indicate the type of error:
• ENOTSUP: Logical partition cannot access a nest accelerator.
nx_rel_excl_access() subroutine
Purpose
Releases a specific number of reserved quality of service (QoS) credits.
Syntax
#include <sys/nx.h>
int nx_rel_excl_access(nx_accel_type_t accel_type,
uint32_t flags,
int number_of_credits,
nx_unit_id_t unit_id);
Note: Currently, NX_GZIP_TYPE is the only accelerator type that is supported. You must specify
NX_GZIP_TYPE in the accel_type parameter.
Description
A privileged application can use this subroutine to release a specific number of QoS credits that the
application had reserved for exclusive use. The arguments are the same as the nx_get_excl_access
subroutine.
Similar to the nx_get_excl_access subroutine, the special values NX_ALL_CREDITS and
NX_ANY_UNIT can be used for the nx_rel_excl_access subroutine.
Note: The nx_rel_excl_access subroutine releases the credits even if they are in use, that is, even
when the application is running a job on the accelerator. In this case, the corresponding operation is
stopped.
Return values
0
Indicates success. The requested number of credits (or all credits if the NX_ALL_CREDITS value is
specified) are released.
-1
An error is detected. The errno variable is set to indicate the type of error:
• ENOTSUP: Logical partition cannot access a nest accelerator.
• EPERM: Calling application does not have the correct privilege level.
• EINVAL: Invalid flags, invalid number of credits, or invalid unit ID.
For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual
Property Department in your country or send inquiries, in writing, to:
Such information may be available, subject to appropriate terms and conditions, including in some cases,
payment of a fee.
Portions of this code are derived from IBM Corp. Sample Programs.
© Copyright IBM Corp. _enter the year or years_.
444 Notices
For more information about the use of various technologies, including cookies, for these purposes,
see IBM’s Privacy Policy at http://www.ibm.com/privacy and IBM’s Online Privacy Statement at http://
www.ibm.com/privacy/details the section entitled “Cookies, Web Beacons and Other Technologies”
and the “IBM Software Products and Software-as-a-Service Privacy Statement” at http://www.ibm.com/
software/info/product-privacy.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at
Copyright and trademark information at www.ibm.com/legal/copytrade.shtml.
INFINIBAND, InfiniBand Trade Association, and the INFINIBAND design marks are trademarks and/or
service marks of the INFINIBAND Trade Association.
The registered trademark Linux is used pursuant to a sublicense from the Linux Foundation, the exclusive
licensee of Linus Torvalds, owner of the mark on a worldwide basis.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the
United States, other countries, or both.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or
its affiliates.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Notices 445
446 AIX Version 7.3: Performance management
Index
Index 447
compiler optimization techniques 344 disk striping (continued)
compiling for floating-point performance (-qfloat) 347 tuning logical volume 190
compiling for specific hardware platforms 83, 346 dispatchable threads 6
compiling with optimization 345 Dynamic Tracking of Fibre Channel devices 196
components
workload
identifying 72
E
compression 218 early allocation algorithm 47
concurrent I/O 220, 317 efficient use of the ld command 375
configuration environment variables 384
expanding 194 examples
contention huge executable 403
memory and bus 57 second subroutine 379
CPU executable programs 5
determining speed 380 execution model
monitoring 92 program 3
performance 92 execution time
CPU options 96 compiler 84
CPU time ratio 102 expanding procedure calls inline (-Q) 347
CPU-intensive programs EXTSHM 396
identifying 101
CPU-limited programs 79
critical resources F
identifying 8
facility
minimizing requirements 9
trace 359
cross invalidate 56
Fast I/O Failure and Dynamic Tracking interaction 199
current instructions 6
Fast I/O Failure for Fibre Channel devices 196
currently dispatched threads 6
Fibre Channel devices
Dynamic Tracking 196
D Fast I/O Failure 196
Fast I/O Failure and Dynamic Tracking interaction
daemons 199
cron 96 file data 312
data serialization 53 file syncs
deferred allocation algorithm 47 tuning 227
designing and implementing efficient programs 79 file system
determining CPU speed 380 buffers 229
DIO 317 cache 319
direct I/O monitoring and tuning 214
performance performance tuning 224
reads 231 reorganization 222
writes 231 file systems
tuning 230 types 214
directory organization 216 file-over-file mounts 217
directory-over-directory mounts 217 filemon reports 174
disk adapter outstanding-requests limit 417 files
disk and disk adapter tunable parameters 417 attributes
disk drive queue depth 417 changing for performance 220
disk I/O compression 222
assessing disk performance 162 fragment size 221
assessing overall 171 mapped 353
asynchronous finding memory-leaking programs 127
tuning 226 fixed disks 3
detailed analysis 171 fixed-disk storage management
monitoring 162 performance overview 49
monitoring and tuning 161 fork () retry interval parameter
raw 193 tuning 142
summary 182 fragmentation
wait time reporting 162 disk
disk I/O pacing 233 assessing 168
disk mirroring free list 42
striped 89 ftp 272
disk striping
designing 191
Index 449
maximum caching monitoring and tuning NFS 299
file data 312 monitoring disk I/O 162
NFS file data 316 monitoring java 354
maxreqs 416 mountd 299
maxservers 416 mounts
maxuproc 413 NameFS 217
measuring CPU usage of kernel threads 104 msgmax 418
measuring CPU use 100 msgmnb 418
memory msgmni 419
AIX Memory Affinity Support 147 msgmnm 419
assessing requirements 128 Multiple page size application support
computational versus file 42 Variable large page size support 156
determining use 115 multiprocessing
extended shared 145 introduction to 51
monitoring and tuning 114 types of
placement 148, 149 shared disks 51
requirements shared memory 52
calculating minimum 126 shared memory cluster 52
using shared 145 shared nothing (pure cluster) 51
who is using 121
memory and bus contention 57
memory load control algorithm 46
N
memory load control facility Name File System 217
VMM 46 name resolution tuning 270
memory load control tuning NameFS 217
the h parameter 135 national language support (NLS)
the m parameter 136 locale versus speed 382
the p parameter 135 ncargs 413
the v_exempt_secs parameter 137 nest accelerator 437
the w parameter 136 netstat 273
memory mapped files 219 network
memory placement 148, 149 tunable parameters 421
memory use determination 115 network file system (NFS)
memory-limited programs 84 analyzing performance 304
methods monitoring and tuning 299
choosing page space allocation 140 overview 299
minfree and maxfree settings 138 version 3 301
minimizing critical-resource requirements 9 network performance analysis 270
minperm 140 network tunable parameters 421
minperm and maxperm settings 139 NFS
minservers 416 see network file system (NFS) 299, 304
MIO tuning
architecture 200 client 315
benefits, cautions 200 server 311
environmental variables 202 NFS client
examples 208 tuning 315
implementation 201 NFS data caching
options definitions 204 read throughput 317
mirroring write throughput 317
disk NFS data caching effects 317
striped 89 NFS file data 316
mode switching 41 NFS option tunable parameters 425
model nfsd 299
program execution 3 nfsd Count 425
modular I/O nfsd threads
architecture 200 number 312, 315
environmental variables 202 nice 62, 109
examples 208 NLS
implementation 201 see national language support (NLS) 382
options definitions 204 NODISCLAIM 396
monitoring and tuning communications I/O use 235 non-shared 321
monitoring and tuning disk I/O use 161 npswarn and npskill settings 141
monitoring and tuning file systems 214 NSORDER 396
monitoring and tuning memory use 114 numclust 425
Index 451
priority reports (continued)
process and thread 37 filemon 174
problems requirements
performance performance
reporting 368 documenting 72
process workload
priority 37 resource 73
process address space location resource allocation
Huge Executable read-only segments 402, 403 reflecting priorities 10
processes and threads 36 resource management overview 35
processor affinity and binding 56 resources
processor scheduler applying additional 10
performance overview 36 critical
processor time slice identifying 8
scheduler 41 response time
processor timer SMP 58, 60
accessing 377 restructuring executable programs 107
profile directed feedback (PDF) 351 RPC lock daemon
program execution model 3 tuning 313
programs RPC mount daemon
CPU-intensive tuning 312
identifying 101 RT_GRQ 396
efficient run queue
cache 79 scheduler 39
cache and TLBs 81
CPU-limited 79
designing and implementing 79
S
levels of optimization 82 scalability
preprocessors and compilers 82 multiprocessor throughput 59
registers and pipeline 81 scaling 216
executable scenarios 427
restructuring 107 scheduler
fdpr 107 processor 36
finding memory-leaking 127 processor time slice 41
memory-limited 84 run queue 39
rebindable executable 376 scheduler tunable parameters 413
xmpert 100 scheduling
ps command 118 SMP thread
PSALLOC 396 algorithm variables 62
default 62
Q thread 38
second subroutine example 379
queue limits segments
disk device 193 persistent versus working 42
scsi adapter 193 semaem 419
semmni 419
semmsl 420
R semopm 420
RAID semume 420
see redundant array of independent disks (RAID) 195 semvmx 420
RAM disk sequential read performance 224
file system 217 sequential write behind 225
random write behind 226 serialization
real memory 3 data 53
real-memory management 42 server_inactivity 416
rebindable executable programs 376 setpri() 62
redundant array of independent disks (RAID) 195 setpriority() 62
registers and pipeline 81 setting objectives 8
release-behind 317 Shared Memory 146
renice 62 shmmax 421
repaging 42 shmmin 421
reporting performance problems 368 shmmni 421
reports size
read
Index 453
tunable parameters (continued) VMM
miscellaneous 396 see virtual memory manager (VMM) 42
network VMM fork policy 430
tcp_mssdflt 423 vmstat command 115
tcp_nodelay 423 volume group
tcp_recvspace 423 considerations 188
tcp_sendspace 424 mirroring 189
use_sndbufpool 425
xmt_que_size 425
nfs option
W
biod Count 425 waiting threads 5
comebehind 425 workload
nfsd Count 425 system 1
numclust 425 workload concurrency
scheduler 413 SMP 57
summary 384 workload multiprocessability
synchronous I/O 414 SMP 58
thread support 384 workload resource requirements
virtual memory manager 414 estimating
tunable values measuring 74
Asynchronous I/O new program 76
416 transforming from program level 78
tuning workloads
adapter queue 258 identifying 7
application 344 SMP 58
hand 344 write behind
IP 265 memory mapped files 219
mbuf pool 266 sequential
name resolution 270 random 225
network memory 267 write-around 321
system 5
TCP and UDP 235
TCP maximum segment size 262 X
thread 63
xmperf 100
tuning file systems 224
xmt_que_size 425
tuning logical volume striping 190
tuning mbuf pool performance 266
tuning paging-space thresholds 141
tuning TCP and UDP performance 235
tuning TCP maximum segment size 262
tuning VMM memory load control 133
tuning VMM page replacement 137
U
understanding the trace facility 357
use_sndbufpool 425
user ID
administration for CPU efficiency 114
V
v_pinshm 229
variables
environment 384
virtual memory and paging space 141
virtual memory manager
tunable parameters 414
virtual memory manager (VMM)
memory load control facility 46
performance overview 42
thresholds 42
virtual memory manager tunable parameters 414