Aix Performance Updates 010609
Aix Performance Updates 010609
Tools & Tunables AIX 5.3 TL07, TL08, TL09 AIX 6.1 TL01, TL02
Steve Nasypany [email protected] IBM Advanced Technical Support
Agenda
SMT POWER5 vs POWER6 AIX 5 vs AIX 6 Tunables Framework VMM Tunings AIX 5.3 Tunables Updates Shared Ethernet Dedicated Processor Donation Virtual Shared Pools AIX 5.3 TL-09 nmon in AIX Topas VIOS/Adapter/MPIO svmon Reports POWER6 p575 & 595
2008 IBM Corporation
Agenda
AIX 6.1 TL01 Workload Partitions Support
ps, ipcs, netstat, proc*, trace, vmstat, topas, tprof, filemon, netpmon, pprof, curt Separate presentations available to cover WPAR specifics
Restricted Tunables IO pacing AIO CIO NFS biod JFS2 nolog Multiple Page Size Segments - svmon iostat/topas - Filesystem and Workload Partition breakdowns (AIX 6) AIX 6.1 TL02 topas Memory Pool and Shared Ethernet monitoring svmon Reports filemon Reports mpstat/sar WPAR support tprof Large Page and Data profiling
3 2008 IBM Corporation
Generally, see perhaps 1% impact from running in SMT mode in Micro partitions on POWER6 Example code from Northwestern University Minebench 1.0 Shows the ratio of the test running in a Micro partition in SMT mode / ST mode
0.9 0.95
0.994475138
ScalParc
1.05
1.1
Tunings on right are universally recommended for AIX 5.3 And AIX 5.2, but limiting cache to no more than 24 GB Set-and-forget, lru_file_repage = 0 protects computational memory, always steal from cache No paging to the paging space will occur unless the system memory is over committed (AVM > 97%)
lru_file_repage=0 Issues
But now my system is ~100% memory usage New memory model results in free memory being consumed by cache AIX does not actively scrub cache, as it is an expensive overhead AIX only looks for memory when it needs it Customers do not know how to assess whether additional workloads can be added without causing physical paging There is no trivial method for knowing how much cache is optimal or active for a given workload Options on next slide If the system is paging to page space with these settings, you are memory bound First, make sure you dont have a memory leak If you have to live with this workload, optimize your paging space
Add paging spaces, spread them out Paging spaces of equal sizes
New Tunables
psm_timeout_interval = 5000
Determines the timeout interval, in milliseconds, to wait for page size management daemons to make forward progress before LRU page replacement is started. This setting is only valid on the 64-bit kernel. Default: 5 seconds. Possible values: 0 through 60,000 (1 minute). When page size management is working to increase the number of page frames of a particular page size, LRU page replacement is delayed for that page size for up to this amount of time. On a heavily loaded system, increasing this tunable can give the page size management daemons more time to create more page frames before LRU runs. Basically, 64 KB page migrations can cause a deadlock between lrud and psmd vmo tunable
New Tunables
JFS2 Sync Tunables (TL08) The file system sync operation can be problematic in situations where there is very heavy random I/O activity to a large file. When a sync occurs all reads and writes from user programs to the file are blocked. With a large number of dirty pages in the file the time required to complete the writes to disk can be large. New JFS2 tunables are provided to relieve that situation.
New Tunables
j2_syncPageCount
Limits the number of modified pages that are scheduled to be written by sync in one pass for a file. When this tunable is set, the file system will write the specified number of pages without blocking i/o to the rest of the file. The sync call will iterate on the write operation until all modified pages have been written. Default: 0 (off), Range: 0-65536, Type: Dynamic, Unit: 4KB pages j2_syncPageLimit Overrides j2_syncPageCount when a threshold is reached. This is to guarantee that sync will eventually complete for a given file. Not applied if j2_syncPageCount is off. Default: 16, Range: 1-65536, Type: Dynamic, Unit: Numeric If application response times impacted by syncd, try j2_syncPageCount settings from 256 to 1024. Smaller values improve short term response times, but still result in larger syncs that impact reponse times over larger intervals. These will likely require a lot of experimentation, and detailed analysis of IO Does not apply to mmap() or shmat() memory files.
2008 IBM Corporation
New Tunables
proc_disk_stats (TL08)
There is a single process-wide structure that is updated for each I/O Structure is protected by a single lock: pv_lock_d More threads doing high I/O, the higher the potential for lock contention
Should be easily visible by using splat lock tool Default behavior not changed. Turn off when process scope disk statistics not required Encountered in DB2 TPC-C benchmark tests
New Tunables
large_receive (TL08)
Shared Ethernet The 10 Gig adapter's LRO ("large receive offload") feature is enabled by default, and this may cause problems for a system configuration where a Shared Ethernet Adapter is bridging traffic for Linux LPARs (which cannot receive packets larger than their MTU). SEA will provide its own "large_receive" attribute, defaulted to "no", which will disable the feature in the underlying real adapter to avoid such problems out of the box. The user has the choice to override this and set the SEA's attribute to "yes" to enable the large receive feature in the underlying device (if available), overriding the device's own large_receive attribute setting SEA large_receive setting is dynamic as long as the adapter large_receive was enabled at boot. Otherwise adapter has to be recycled to support SEA change.
seastat
$ seastat -? Usage: seastat -d <device name> -c seastat -d <device name> [-n | -s searchtype=value] $ chdev -dev ent8 -attr accounting=enabled ent8 changed $ seastat -d ent8
============================================================================= Advanced Statistics for SEA Device Name: ent8 ============================================================================= MAC: A6:3C:00:09:33:04 ---------------------VLAN: None VLAN Priority: None Hostname: js22aix.aixncc.uk.ibm.com IP: 9.69.44.177 Transmit Statistics: -------------------Packets: 8 Bytes: 646 Receive Statistics: ------------------Packets: 18 Bytes: 1103
21
# lparstat 1 3 System configuration: type=Dedicated mode=Donating smt=On lcpu=2 mem=800 %user %sys %wait %idle physc vcsw ---- ---- ---- ----0.1 0.4 0.0 0.2 0.0 0.2 ---- ------0.0 99.5 0.68 670234 0.0 99.8 0.68 670234 0.0 99.8 0.68 670234
: 1.00
donation causes hardware context switches shows actual physical processor consumption: number of physical processors minus donated and stolen cycles
24
# lparstat d System configuration: type=Dedicated mode=Donating smt=On lcpu=2 mem=800 %user ----0.1 %sys ---0.2 %wait ----2.1 %idle ----97.7 %idon %bdon %istol %bstol ------ ----- ----- -----12.79 6.8 4.8 2.75
# lparstat -dh System configuration: type=Dedicated mode=Capped smt=On lcpu=2 mem=800 %user ----0.1
25
%sys ---0.2
%wait ----2.1
%idle ----97.7
%hypv hcalls %istol %bstol ----- ------ ------ -----0.0 391 4.8 2.75
2008 IBM Corporation
idon, bdon: percentages of idle and busy times donated istol, bstol: percentages of idle and busy times stolen
mpstat
automaticaly displays pc and lcs if donation is enabled new -h option to show more details on hypervisor related statistics donation enabled System configuration: lcpu=2 mode=Donating cpu pc ilcs vlcs idon bdon 0 0.3 50327 687231635 10.2 4.5 1 0.5 61702 684989764 10.2 4.5 ALL 0.8 112029 1372221399 20.4 9.0 donation disabled System configuration: cpu pc ilcs 0 0.3 503727 1 0.41 61702 ALL 0.71 565429
lcpu=2 mode=Capped vlcs istol bstol 687231635 0.59 0.32 684989764 0.59 0.32 1372221399 1.18 0.64
shared partition System configuration: lcpu=2 ent=0.5 mode=Uncapped cpu pc ilcs vlcs 0 0.6 503727 687231635 1 0.6 61702 684989764 ALL 0.8 565429 1372221399
26 2008 IBM Corporation
3200.0 Partition CPU Utilization Online Virtual CPUs: 1 Online Logical CPUs: 2 %user %sys %wait %idle %hypv hcalls %istl %bstl %idon %bdon vcsw 1 1 0 98 1 200 0 2.1 3.5 10.0 1.0
27
donated cycles
28
Default format hard to read with many hdisks. Use l option for wide output
Earlier AIX 5.3 levels may report sqfull as a delta, but APARs fixes convert to rate, so values will be much smaller
Cant exceed queue_depth for the disk If this is often > 0, then increase queue_depth
Service Time Goals Reads < 20 msecs Writes with cache < 2 msecs w/o cache < 10 msecs
2008 IBM Corporation
timeouts fails 0 0
30
All attributes of a pool can be changed dynamically LPARs can be re-assigned to different pools dynamically
Software Requirements
eFW 3.2 or later AIX 5.3 TL07 or later AIX 6.1 or later
Surprisingly, many customer do not seem to be prepared for monitoring the shared pool Make sure at least one partition on the CEC can do pool monitoring! Required for lparstat to see free pool resources, but topas gets around this because it can collect data from remote agents and calculate itself
35
'nmon' in AIX
Can be started by running command 'nmon' or topas_nmon Can be started by pressing ~ from topas screen
./topas_nmon -h Hint: topas_nmon [-h] [-s <seconds>] [-c <count>] [-f -d -t -r <name>] [-x] Command: TOPAS-NMON -h FULL help information - much more than here Interactive-Mode: read startup banner and type: "h" once it is running For Data-Collect-Mode (-f) -f spreadsheet output format [note: default -s300 -c288] optional -s <seconds> between refreshing the screen [default 2] -c <number> of refreshes [default millions] -t spreadsheet includes top processes -x capacity planning (15 min for 1 day = -fdt -s 900 -c 96) For Interactive-Mode -s <seconds> between refreshing the screen [default 2] -c <number> of refreshes [default millions] -g <filename> User decided Disk Groups - file = on each line: group_name <hdisk_list> space separated - like: rootvg hdisk0 hdisk1 hdisk2 - upto 32 groups hdisks can appear more than once -b black and white [default is colour] -B no boxes [default is show boxes] example: topas_nmon -s 1 -c 100
Memory Throttling
Larger DIMMs will be throttled, no tools can see this Implemented in POWER6 p575 and p595 platforms Not expected to be a major issue, but lack of measurement capability is a concern
61 2008 IBM Corporation
AIX 6.1
AIX 6.1 TL01 Workload Partitions Support
ps, ipcs, netstat, proc*, trace, vmstat, topas, tprof, filemon, netpmon, pprof, curt Separate presentations available to cover WPAR specifics
Restricted Tunables IO pacing AIO CIO NFS biod JFS2 nolog Multiple Page Size Segments - svmon iostat/topas - Filesystem and Workload Partition breakdowns (AIX 6) AIX 6.1 TL02 topas Memory Pool and Shared Ethernet monitoring filemon Reports mpstat/sar WPAR support tprof Large Page and Data profiling
62 2008 IBM Corporation
Performance Tunables
Tunables now in two categories Restricted Tunables
Should not be changed unless recommended by AIX development or development support Are not shown by tuning commands unless the F flag is used Dynamic change will show a warning message Permanent change must be confirmed Permanent changes will cause an error log entry at boot time
Non-Restricted Tunable
Can have restricted tunables as dependencies
A permanent change of a restricted tunable requires a confirmation from the user. Note: The system will log changes to restricted tunable in the system error log at boot time.
2008 IBM Corporation
Description RESTRICTED TUNABLES MODIFIED AT REBOOT Probable Causes SYSTEM TUNING User Causes TUNABLE PARAMETER OF TYPE RESTRICTED HAS BEEN MODIFIED Recommended Actions REVIEW TUNABLE LISTS IN DETAILED DATA Detail Data LIST OF TUNABLE COMMANDS CONTROLLING MODIFIED RESTRICTED TUNABLES AT REBOOT, SEE FILE /etc/tunables/lastboot.log
Implementation Considerations
Best Practices Do not apply legacy tuning since some tunables may now be restricted If you do an upgrade install, your old tunings will be preserved You may wish to undo them, but we wont make you This level of tune was been applied to numerous AIX 5.3 customers through field support We are confident this was a good thing However, we try to never change defaults in the service stream, so AIX 5.3 remains as it was Change restricted tunables only if recommended by AIX support
New defaults
Not very aggressive, intended to limit one or a few programs from impacting system responsiveness. Values high enough not to impact sequential write performance maxpout = 8193 minpout = 4096
AIO Support
Interface Changes All the AIO entries in the ODM and AIO smit panels have been removed The aioo command will not longer be shipped All the AIO tunables have current, default, minimum and maximum value that can be viewed with ioo AIO kernel extension loaded at system boot Applications no longer fail to run because you forgot to load the kernel extension (you may applaud here) No AIO servers are active until requests are present Extremely low impact on memory requirements with this implementation
Application
Application File System
AIO Server
FS no Fast Path
Device Driver
kdb 'u <slotnumber>' then for each file listed there 'file <filepointer>' gives some info
32biod 4biod
re ad
se q
Overlapping more compute with network traffic through more biods greatly improves throughput Same model as previous chart, krbp5 (full packet encryption) mount option
32biod 4biod
se
re
ad
re ad
se q
PHP Wikibench
90 80 70 60 50 40 30 20 10 0 Default log nolog
82
ExtPage
avg-cpu: % user % sys % idle % iowait physc % entc 39.0 Kbps tps 70.0 70.0 Kbps 3.7 0.0 0.0 43.8 0.0 0.0 0.0 0.0 2.0 0.0 0.0 968.0 0.0 0.0 0.0 0.0 tps 4.9 53.8 Kb_wrtn 0 0 3897 3897 Kb_read 3 0 0 0 0 0 0 0 Kb_wrtn 0 0 0 43 0 0 0 0
2008 IBM Corporation
2.3
0.2
46.0
Kb_read
================================================================================ FileSystem /tmp /var /usr / /home /audit /admin /proc /opt KBPS 47.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 TPS 967.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 KB-R 0.0 202.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 KB-W 47.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Open 0 0 0 0 0 0 0 0 0 Create 0 0 0 0 0 0 0 0 0 Lock 0 0 0 0 0 0 0 0 0
85
Invoking mpstat inside a WPAR to view statistics for all the processors in the system
Invoking sar inside WPAR to view all processor statistics. The Red Circled CPU ID with a prefix '*' indicates that the CPU is associated with the RSET used by the WPAR
Memory Reference and Allocation counts Memory References, Allocations summary by process Memory References by Modeled regions
Performance Projections of Memory Translation Misses by modeled regions for various page sizes
Summary section which reports the % of data access for each data region in the process
Detail by Data Structure Name and the subroutines that accessed those data structures
Trademarks
The following are trademarks of the International Business Machines Corporation in the United States, other countries, or both.
Not all common law marks used by IBM are listed on this page. Failure of a mark to appear does not mean that IBM does not use the mark nor does it mean that the product is not actively marketed or is not significant within its relevant market. Those trademarks followed by are registered trademarks of IBM in the United States; all others are trademarks or common law marks of IBM in the United States.
101