Solaris 10 System Internals
Solaris 10 System Internals
Richard McDougall
Distinguished Engineer Sun Microsystems, Inc [email protected]
Solaris 10
Jim Mauro
Senior Staff Engineer Sun Microsystems, Inc [email protected]
This tutorial is copyright 2006 by Richard McDougall and James Mauro. It may not be used in whole or part for commercial purposes without the express written consent of Richard McDougall and James Mauro
[email protected] [email protected]
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
Agenda
Session 1 - 9:00AM to 10:30PM
> > > > >
Goals, non goals and assumptions OpenSolaris Solaris 10 Kernel Overview Solaris 10 Features The Tools of the Trade
Agenda
Session 3 - 2:00PM to 3:30PM
> Processes, Threads & Scheduling > Processes, Threads, Priorities & Scheduling > Performance & Observability Load, apps & the kernel > Processor Controls and Binding > Resource Pools, Projects & Zones
them > Correlate performance & observability to key functions > Resource control & management framework
Non-goals
> Detailed look at core kernel algorithms > Networking internals
Assumptions
> General familiarity with the Solaris environment > General familiarity with operating systems concepts
Copyright 2006 Richard McDougall & James Mauro
OpenSolaris
An open source operating system providing for community collaboration and development Source code released under the Common Development & Distribution License (CDDL pronounced cuddle) Based on Nevada Solaris code-base (Solaris 10+) Core components initially, other systems will follow over time
> ZFS!
90% of system activity falls into one of the above categories, for a variety of roles
> Admins, DBA's, Developers, etc...
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
System View
> Resource usage/utilization > CPU, Memory, Network, IO
Process View
> Execution profile > Where's the time being spent > May lead to a thread view
10
Amdahl's Law
In general terms, defines the expected speedup of a system when part of the system is improved As applied to multiprocessor systems, describes the expected speedup when a unit of work is parallelized
> Factors in degree of parallelization
1 S= 1F F N S= 1 10.5 0.5 4 1 10.25 0.25 4
S is the speedup F is the fraction of the work that is serialized N is the number of processors S = 1.6 S = 2.3 4 processors, of the work is serialized 4 processors, of the work is serialized
S=
11
Tightly Integrated File System & Virtual Memory Virtual File System 64-bit kernel
> 32-bit and 64-bit application support
12
Solaris 7, 8, 9, 10, ... ILP32 Apps ILP32 Libs ILP32 Kernel ILP32 Drivers 32-bit HW
Usenix '06 Boston, Massachusetts
/dev/poll for scalable I/O Modular debugging with mdb(1) You want statistics?
14
dispatcher
dispatcher
processors
processors
process
Copyright 2006 Richard McDougall & James Mauro
User thread
LWP
Kernel thread
15
Solaris 9
A Subset of the 300+ New Features Manageability Availability
Security
Scalability
Solaris Containers SolarisTM 9 Resource Manager IPQoS SolarisTM Volume Manager (SVM) Soft Disk Partitions Filesystem for DBMS UFS Snapshots SolarisTM Flash SolarisTM Live Upgrade 2.0 Patch Manager Product Registry Sun ONE DS integration Legacy directory proxy Secure LDAP client Solaris WBEM Services Solaris instrumentation FRU ID SunTM Management Center
Solaris Live Upgrade 2.0 Dynamic Reconfiguration Sun StorEdgeTM Traffic Manager Software IP Multipathing Reconfiguration Coordination Manager Driver Fault Injection Framework Mobile IP Reliable NFS TCP timers PCI and cPCI hot-swap
IPv6 IPSec v4 and v6 Thread enhancements SunScreen Firewall Memory optimization Enhanced RBAC Advanced page coloring Kerberos V5 Mem Placement Optimization IKE Multi Page Size Support PAM enhancements Hotspot JVM tuning Secure sockets layer (SSL) NFS performance increase SolarisTM Secure Shell UFS Direct I/O Extensible password Dynamic System Domains encryption Enhanced DNLC SolarisTM RSM API Security Toolkit TM J2SE 1.4 software with TCP Wrappers 64-bit and IPv6 Kernel and user-level NCA enhancements encryption frameworks Random number generator SmartCard APIs . . . and more: Compatibility Guarantee Java Support Linux Compatibility Network Services G11N and Accessibility GNOME Desktop 16
Solaris 10
The Headline Grabbers
Solaris Containers (Zones) Solaris Dynamic Tracing (dtrace) Predictive Self Healing
> System Management Framework (SMF) > Fault Management Architecture (FMA)
Process Rights Management Premier x86 support Optimized 64-bit Opteron support (x64) Zetabyte Filesystem (ZFS) ... and much, much more!
Usenix '06 Boston, Massachusetts
17
UFS NFS
ProcFS
. . SpecFS .
TCP/IP
sd ssd
...
18
19
System Stats
abitrace trace ABI interfaces dtrace trace the world mdb debug/control processes truss trace functions and system calls
acctcom process accounting busstat Bus hardware counters cpustat CPU hardware counters iostat IO & NFS statistics kstat display kernel statistics mpstat processor statistics netstat network statistics nfsstat nfs server stats sar kitchen sink utility vmstat virtual memory stats
Process control
pgrep grep for processes pkill kill processes list pstop stop processes prun start processes prctl view/set process resources pwait wait for process preap reap a zombie process
dtrace trace and monitor kernel lockstat monitor locking statistics lockstat -k profile kernel mdb debug live and kernel cores
20
Solaris 10 Dynamic Tracing - DTrace [expletive deleted] It's like they saw inside my head and gave me The One True Tool.
- A Slashdotter, in a post referring to DTrace
With DTrace, I can walk into a room of hardened technologists and get them giggling
- Bryan Cantrill, Inventor of DTrace
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
21
DTrace
Solaris Dynamic Tracing An Observability Revolution
Seamless, global view of the system from user-level thread to kernel Not reliant on pre-determined trace points, but dynamic instrumentation Data aggregation at source minimizes postprocessing requirements Built for live use on production systems
22
DTrace
Solaris Dynamic Tracing An Observability Revolution
Ease-of-use and instant gratification engenders serious hypothesis testing Instrumentation directed by high-level control language (not unlike AWK or C) for easy scripting and command line use Comprehensive probe coverage and powerful data management allow for concise answers to arbitrary questions
23
DTrace Components
Probes
> A point of instrumentation > Has a name (string), and a unique probe ID (integer) > provider:module:function:name
Providers
> DTrace-specific facilities for managing probes, and the
Consumers
> A process that interacts with dtrace > typically dtrace(1)
Using dtrace
> Command line dtrace(1) > Scripts written in the 'D' language
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
24
dtrace(1M)
DTrace
dtrace providers
sysinfo proc
vminfo syscall
DTrace
Built-in variables
> pid, tid, execname, probefunc, timestamp, zoneid, etc
Predicates
> Conditional expression before taking action
Aggregations
> process collected data at the source
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
26
27
The D language
D is a C-like language specific to DTrace, with some constructs similar to awk(1) Complete access to kernel C types Complete access to statics and globals Complete support for ANSI-C operators Support for strings as first-class citizen We'll introduce D features as we need them...
28
DTrace D scripts
usenix> cat syscalls_pid.d #!/usr/sbin/dtrace -s dtrace:::BEGIN { vtotal = 0; } syscall:::entry /pid == $target/ { self->vtime = vtimestamp; } a complete dtrace script block, including probename, a predicate, and an action in the probe clause, which sets a thread-local variable
syscall:::return /self->vtime/ { @vtime[probefunc] = sum(vtimestamp - self->vtime); vtotal += (vtimestamp - self->vtime); self->vtime = 0; } dtrace:::END { normalize(@vtime, vtotal / 100); printa(@vtime); }
29
0 0 1 1 2 2 4 8 8 9 9 11 15 24
30
DTrace Providers
Providers manage groups of probes that are related in some way Created as part of the DTrace framework to enable dtracing key subsystems without an intimate knowledge of the kernel > vminfo statistics on the VM subsystem > syscall entry and return points for all system calls > args available at entry probes > sched key events in the scheduler > io disk IO tracing > sysinfo kstats sys statistics > mib network stack probing > pid instrumenting user processes > fbt function boundary tracing (kernel functions) > args available as named types at entry (args[0] ... args[n])
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
31
32
33
Aggregations
When trying to understand suboptimal performance, one often looks for patterns that point to bottlenecks When looking for patterns, one often doesn't want to study each datum one wishes to aggregate the data and look for larger trends Traditionally, one has had to use conventional tools (e.g. awk(1), perl(1)) to post-process reams of data DTrace supports aggregation of data as a first class operation
34
Aggregations, cont.
An aggregation is the result of an aggregating function keyed by an arbitrary tuple For example, to count all system calls on a system by system call name:
dtrace -n 'syscall:::entry \ { @syscalls[probefunc] = count(); }'
35
Aggregations, cont.
Aggregations need not be named Aggregations can be keyed by more than one expression For example, to count all ioctl system calls by both executable name and file descriptor:
dtrace -n 'syscall::ioctl:entry \ { @[execname, arg0] = count(); }'
36
Aggregations, cont.
Some other aggregating functions:
> > > > > >
avg() - the average of specified expressions min() - the minimum of specified expressions max() - the maximum of specified expressions count() - number of times the probe fired quantize() - power-of-two distribution lquantize() - linear frequency distribution
37
Aggregations
# dtrace -n 'syscall::write:entry { @[execname] = quantize(arg2); }' dtrace: description 'syscall::write:entry ' matched 1 probe ^C in.rshd value ------------- Distribution ------------- count 0 | 0 1 |@@@@@@@@@@ 16 2 |@@@@ 6 4 |@@@ 4 8 | 0 16 |@@@@@ 7 32 |@@@ 4 64 |@@@@@@@@@@@@@@@ 23 128 |@ 1 256 | 0 cat value 128 256 512 1024 2048 4096 ------------- Distribution ------------| |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | | |@@@@@@@@@@@@@ | count 0 2 0 0 1 0
38
39
DTrace
The Solaris Dynamic Tracing Observability Revolution
Not just for diagnosing problems Not just for kernel engineers Not just for service personel Not just for application developers Not just for system administrators Serious fun Not to be missed!
40
Extensive support for debugging of processes /etc/crash and adb removed Symbol information via compressed typed data Documentation
mdb(1)
41
> ::dcmds -l for a list > expression::dcmd > e.g. 0x300acde123::ps > walkers > ::walkers for a list > expression::walk <walker_name> > e.g. ::walk cpu > macros > !ls /usr/lib/adb for a list > expression$<macro > e.g. cpu0$<cpu
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
42
Pipelines
> expression, dcmd or walk can be piped > ::walk <walk_name> | ::dcmd > e.g. ::walk cpu | ::print cpu_t > Link Lists > address::list <type> <member> > e.g. 0x70002400000::list page_t p_vpnext
Modules
> Modules in /usr/lib/mdb, /usr/platform/lib/mdb etc > mdb can use adb macros > Developer Interface - write your own dcmds and walkers
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
43
> ::cpuinfo ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH 0 0000180c000 1b 0 0 37 no no t-0 1 30001b78000 1b 0 0 27 no no t-0 4 30001b7a000 1b 0 0 59 no no t-0 5 30001c18000 1b 0 0 59 no no t-0 8 30001c16000 1b 0 0 37 no no t-0 9 30001c0e000 1b 0 0 59 no no t-1 12 30001c06000 1b 0 0 -1 no no t-0 13 30001c02000 1b 0 0 27 no no t-1 > 30001b78000::cpuinfo -v ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH 1 30001b78000 1b 0 0 -1 no no t-3 | RUNNING <--+ READY EXISTS ENABLE
> 30001b78000::cpuinfo -v ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 1 30001b78000 1b 0 0 27 no no t-1 300132c5900 threads | RUNNING <--+ READY EXISTS ENABLE > 300132c5900::findstack stack pointer for thread 300132c5900: 2a1016dd1a1 000002a1016dd2f1 user_rtt+0x20()
44
45
Kernel Statistics
46
Procfs Tools
Observability (and control) for active processes through a pseudo file system (/proc) Extract interesting bits of information on running processes Some commands work on core files as well
pargs pflags pcred pldd psig pstack pmap pfiles pstop prun pwait ptree ptime preap*
*why do Harry Cooper & Ben wish they had preap?
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
47
48
psig
sol8$ psig $$ 15481: -zsh HUP caught 0 INT blocked,caught 0 QUIT blocked,ignored ILL blocked,default TRAP blocked,default ABRT blocked,default EMT blocked,default FPE blocked,default KILL default BUS blocked,default SEGV blocked,default SYS blocked,default PIPE blocked,default ALRM blocked,caught 0 TERM blocked,ignored USR1 blocked,default USR2 blocked,default CLD caught 0 PWR blocked,default WINCH blocked,caught 0 URG blocked,default POLL blocked,default STOP default
49
pstack
sol8$ pstack 5591 5591: /usr/local/mozilla/mozilla-bin ----------------- lwp# 1 / thread# 1 -------------------fe99a254 poll (513d530, 4, 18) fe8dda58 poll (513d530, fe8f75a8, 18, 4, 513d530, ffbeed00) + 5c fec38414 g_main_poll (18, 0, 0, 27c730, 0, 0) + 30c fec37608 g_main_iterate (1, 1, 1, ff2a01d4, ff3e2628, fe4761c9) + 7c0 fec37e6c g_main_run (27c740, 27c740, 1, fe482b30, 0, 0) + fc fee67a84 gtk_main (b7a40, fe482874, 27c720, fe49c9c4, 0, 0) + 1bc fe482aa4 ???????? (d6490, fe482a6c, d6490, ff179ee4, 0, ffe) fe4e5518 ???????? (db010, fe4e5504, db010, fe4e6640, ffbeeed0, 1cf10) 00019ae8 ???????? (0, ff1c02b0, 5fca8, 1b364, 100d4, 0) 0001a4cc main (0, ffbef144, ffbef14c, 5f320, 0, 0) + 160 00014a38 _start (0, 0, 0, 0, 0, 0) + 5c ----------------- lwp# 2 / thread# 2 -------------------fe99a254 poll (fe1afbd0, 2, 88b8) fe8dda58 poll (fe1afbd0, fe840000, 88b8, 2, fe1afbd0, 568) + 5c ff0542d4 ???????? (75778, 2, 3567e0, b97de891, 4151f30, 0) ff05449c PR_Poll (75778, 2, 3567e0, 0, 0, 0) + c fe652bac ???????? (75708, 80470007, 7570c, fe8f6000, 0, 0) ff13b5f0 Main__8nsThreadPv (f12f8, ff13b5c8, 0, 0, 0, 0) + 28 ff055778 ???????? (f5588, fe840000, 0, 0, 0, 0) fe8e4934 _lwp_start (0, 0, 0, 0, 0, 0)
50
pfiles
sol8$ pfiles $$ pfiles $$ 15481: -zsh Current rlimit: 256 file descriptors 0: S_IFCHR mode:0620 dev:118,0 ino:459678 uid:36413 gid:7 rdev:24,11 O_RDWR 1: S_IFCHR mode:0620 dev:118,0 ino:459678 uid:36413 gid:7 rdev:24,11 O_RDWR 2: S_IFCHR mode:0620 dev:118,0 ino:459678 uid:36413 gid:7 rdev:24,11 O_RDWR 3: S_IFDOOR mode:0444 dev:250,0 ino:51008 uid:0 gid:0 size:0 O_RDONLY|O_LARGEFILE FD_CLOEXEC door to nscd[328] 10: S_IFCHR mode:0620 dev:118,0 ino:459678 uid:36413 gid:7 rdev:24,11 O_RDWR|O_LARGEFILE
51
pfiles
solaris10> pfiles 26337 26337: /usr/lib/ssh/sshd Current rlimit: 256 file descriptors 0: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2 O_RDWR|O_LARGEFILE /devices/pseudo/mm@0:null 1: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2 O_RDWR|O_LARGEFILE /devices/pseudo/mm@0:null 2: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2 O_RDWR|O_LARGEFILE /devices/pseudo/mm@0:null 3: S_IFDOOR mode:0444 dev:279,0 ino:59 uid:0 gid:0 size:0 O_RDONLY|O_LARGEFILE FD_CLOEXEC door to nscd[93] /var/run/name_service_door 4: S_IFSOCK mode:0666 dev:276,0 ino:36024 uid:0 gid:0 size:0 O_RDWR|O_NONBLOCK SOCK_STREAM SO_REUSEADDR,SO_KEEPALIVE,SO_SNDBUF(49152),SO_RCVBUF(49880) sockname: AF_INET6 ::ffff:129.154.54.9 port: 22 peername: AF_INET6 ::ffff:129.150.32.45 port: 52002 5: S_IFDOOR mode:0644 dev:279,0 ino:55 uid:0 gid:0 size:0 O_RDONLY FD_CLOEXEC door to keyserv[179] /var/run/rpc_door/rpc_100029.1 ....
52
53
pgrep
sol8$ pgrep -u rmc 481 480 478 482 483 484 .....
54
prstat(1)
top-like utility to monitor running processes Sort on various thresholds (cpu time, RSS, etc) Enable system-wide microstate accounting
> Monitor time spent in each microstate
55
truss(1)
trace the system calls of a process/command Extended to support user-level APIs (-u, -U) Can also be used for profile-like functions (-D, -E) Is thread-aware as of Solaris 9 (pid/lwp_id)
usenix> truss -c -p 2556 ^C syscall seconds calls errors read .013 1691 pread .015 1691 pread64 .056 846 -------- --------sys totals: .085 4228 0 usr time: .014 elapsed: 7.030 usenix> truss -D -p 2556 /2: 0.0304 pread(11, "02\0\0\001\0\0\0\n c\0\0".., 256, 0) /2: 0.0008 read(8, "1ED0C2 I", 4) /2: 0.0005 read(8, " @C9 b @FDD4 EC6", 8) /2: 0.0006 pread(11, "02\0\0\001\0\0\0\n c\0\0".., 256, 0) /2: 0.0134 pread64(10, "\0\0\0\0\0\0\0\0\0\0\0\0".., 8192, /2: 0.0006 pread(11, "02\0\0\001\0\0\0\n c\0\0".., 256, 0) /2: 0.0005 read(8, "D6 vE5 @", 4) /2: 0.0005 read(8, "E4CA9A -01D7AAA1", 8) /2: 0.0006 pread(11, "02\0\0\001\0\0\0\n c\0\0".., 256, 0)
56
lockstat(1M)
Provides for kernel lock statistics (mutex locks, reader/writer locks) Also serves as a kernel profiling tool Use -i 971 for the interval to avoid collisions with the clock interrupt, and gather fine-grained data
#lockstat -i 971 sleep 300 > lockstat.out
57
58
trapstat(1)
Solaris 9, Solaris 10 (and beyond...) Statistics on CPU traps
> Very processor architecture specific
59
#trapstat -t cpu m| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim -----+-------------------------------+-------------------------------+---0 u| 360 0.0 0 0.0 | 324 0.0 0 0.0 | 0.0 0 k| 44 0.0 0 0.0 | 21517 1.1 175 0.0 | 1.1 -----+-------------------------------+-------------------------------+---1 u| 2680 0.1 0 0.0 | 10538 0.5 12 0.0 | 0.6 1 k| 111 0.0 0 0.0 | 11932 0.7 196 0.1 | 0.7 -----+-------------------------------+-------------------------------+---4 u| 3617 0.2 2 0.0 | 28658 1.3 187 0.0 | 1.5 4 k| 96 0.0 0 0.0 | 14462 0.8 173 0.1 | 0.8 -----+-------------------------------+-------------------------------+---5 u| 2157 0.1 7 0.0 | 16055 0.7 1023 0.2 | 1.0 5 k| 91 0.0 0 0.0 | 12987 0.7 142 0.0 | 0.7 -----+-------------------------------+-------------------------------+---8 u| 1030 0.1 0 0.0 | 2102 0.1 0 0.0 | 0.2 8 k| 124 0.0 1 0.0 | 11452 0.6 76 0.0 | 0.6 -----+-------------------------------+-------------------------------+---9 u| 7739 0.3 15 0.0 | 112351 4.9 664 0.1 | 5.3 9 k| 78 0.0 3 0.0 | 65578 3.2 2440 0.6 | 3.8 -----+-------------------------------+-------------------------------+---12 u| 1398 0.1 5 0.0 | 8603 0.4 146 0.0 | 0.5 12 k| 156 0.0 4 0.0 | 13471 0.7 216 0.1 | 0.8 -----+-------------------------------+-------------------------------+---13 u| 303 0.0 0 0.0 | 346 0.0 0 0.0 | 0.0 13 k| 10 0.0 0 0.0 | 27234 1.4 153 0.0 | 1.4 =====+===============================+===============================+==== ttl | 19994 0.1 37 0.0 | 357610 2.1 5603 0.2 | 2.4
60
vmstat(1)
> Memory statistics > Don't forget vmstat -p for per-page type statistics
netstat(1)
> Network packet rates > Use with care it does induce probe effect
iostat(1)
> Disk I/O statistics > Rates (IOPS), bandwidth, service times
sar(1)
> The kitchen sink
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
61
cputrack(1)
Gather CPU hardware counters, per process
solaris> cputrack -N 20 -c pic0=DC_access,pic1=DC_miss -p 19849 time lwp event pic0 pic1 1.007 1 tick 34543793 824363 1.007 2 tick 0 0 1.007 3 tick 1001797338 5153245 1.015 4 tick 976864106 5536858 1.007 5 tick 1002880440 5217810 1.017 6 tick 948543113 3731144 2.007 1 tick 15425817 745468 2.007 2 tick 0 0 2.014 3 tick 1002035102 5110169 2.017 4 tick 976879154 5542155 2.030 5 tick 1018802136 5283137 2.033 6 tick 1013933228 4072636 ...... solaris> bc -l 824363/34543793 .02386428728310177171 ((100-(824363/34543793))) 99.97613571271689822829
62
63
64
65
unix`kcopy+0x38 genunix`copyin_nowatch+0x48 genunix`copyin_args32+0x45 genunix`syscall_entry+0xcb unix`sys_syscall32+0xe1 1 unix`sys_syscall32+0xae 1 unix`mutex_exit+0x19 ufs`rdip+0x368 ufs`ufs_read+0x1a6 genunix`fop_read+0x29 genunix`pread64+0x1d7 unix`sys_syscall32+0x101 2 unix`kcopy+0x2c genunix`uiomove+0x17f ufs`rdip+0x382 ufs`ufs_read+0x1a6 genunix`fop_read+0x29 genunix`pread64+0x1d7 unix`sys_syscall32+0x101 13
66
67
68
3 36 2727656
69
70
71
R/W writer blocked by writer: 17 events in 30.031 seconds (1 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------17 100% 100% 0.00 465308 0xffffffff831f3be0 ufs_getpage+0x369 ------------------------------------------------------------------------------R/W writer blocked by readers: 55 events in 30.031 seconds (2 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------55 100% 100% 0.00 1232132 0xffffffff831f3be0 ufs_getpage+0x369 ------------------------------------------------------------------------------R/W reader blocked by writer: 22 events in 30.031 seconds (1 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------18 82% 82% 0.00 56339 0xffffffff831f3be0 ufs_getpage+0x369 4 18% 100% 0.00 45162 0xffffffff831f3be0 ufs_putpages+0x176 ------------------------------------------------------------------------------R/W reader blocked by write wanted: 47 events in 30.031 seconds (2 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------46 98% 98% 0.00 369379 0xffffffff831f3be0 ufs_getpage+0x369 1 2% 100% 0.00 118455 0xffffffff831f3be0 ufs_putpages+0x176 -------------------------------------------------------------------------------
72
73
74
Example 2
75
mpstat(1)
solaris10> mpstat 2 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 3 0 10 345 219 44 0 1 3 0 28 0 0 0 99 1 3 0 5 39 1 65 1 2 1 0 23 0 0 0 100 2 3 0 3 25 5 22 1 1 2 0 25 0 1 0 99 3 3 0 3 19 0 27 1 2 1 0 22 0 0 0 99 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 4 0 11565 14115 228 7614 1348 2732 3136 1229 255474 10 28 0 61 1 0 0 10690 14411 54 7620 1564 2546 2900 1182 229899 10 28 0 63 2 0 0 10508 14682 6 7714 1974 2568 2917 1222 256806 10 29 0 60 3 0 0 9438 14676 0 7284 1582 2362 2622 1126 249150 10 30 0 60 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 11570 14229 224 7608 1278 2749 3218 1251 254971 10 28 0 61 1 0 0 10838 14410 63 7601 1528 2669 2992 1258 225368 10 28 0 62 2 0 0 10790 14684 6 7799 2009 2617 3154 1299 231452 10 28 0 62 3 0 0 9486 14869 0 7484 1738 2397 2761 1175 237387 10 28 0 62 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 10016 12580 224 6775 1282 2417 2694 999 269428 10 27 0 63 1 0 0 9475 12481 49 6427 1365 2229 2490 944 271428 10 26 0 63 2 0 0 9184 12973 3 6812 1858 2278 2577 985 231898 9 26 0 65 3 0 0 8403 12849 0 6382 1428 2051 2302 908 239172 9 25 0 66 ...
76
prstat(1)
PID USERNAME SIZE RSS STATE PRI NICE 21487 root 603M 87M sleep 29 10 21491 morgan 4424K 3900K cpu2 59 0 427 root 16M 16M sleep 59 0 21280 morgan 2524K 1704K sleep 49 0 21278 morgan 7448K 1888K sleep 59 0 489 root 12M 9032K sleep 59 0 21462 root 493M 3064K sleep 59 0 209 root 4132K 2968K sleep 59 0 208 root 1676K 868K sleep 59 0 101 root 2124K 1232K sleep 59 0 198 daemon 2468K 1596K sleep 59 0 113 root 1248K 824K sleep 59 0 193 daemon 2424K 1244K sleep 59 0 360 root 1676K 680K sleep 59 0 217 root 1760K 992K sleep 59 0 Total: 48 processes, 160 lwps, load averages: TIME CPU PROCESS/NLWP 0:01:50 35% filebench/9 0:00:00 0.0% prstat/1 0:08:40 0.0% Xorg/1 0:00:00 0.0% bash/1 0:00:00 0.0% sshd/1 0:03:05 0.0% dtgreet/1 0:00:01 0.0% filebench/2 0:00:13 0.0% inetd/4 0:00:00 0.0% sac/1 0:00:00 0.0% syseventd/14 0:00:00 0.0% statd/1 0:00:00 0.0% powerd/2 0:00:00 0.0% rpcbind/1 0:00:00 0.0% smcboot/1 0:00:00 0.0% ttymon/1 1.32, 0.83, 0.43
77
prstat(1) Threads
PID USERNAME SIZE RSS STATE PRI NICE 21495 root 603M 86M sleep 11 10 21495 root 603M 86M sleep 3 10 21495 root 603M 86M sleep 22 10 21495 root 603M 86M sleep 60 10 21495 root 603M 86M cpu1 21 10 21495 root 603M 86M sleep 21 10 21495 root 603M 86M sleep 12 10 21495 root 603M 86M sleep 60 10 21462 root 493M 3064K sleep 59 0 21497 morgan 4456K 3924K cpu0 59 0 21278 morgan 7448K 1888K sleep 59 0 427 root 16M 16M sleep 59 0 21280 morgan 2524K 1704K sleep 49 0 489 root 12M 9032K sleep 59 0 514 root 3700K 2812K sleep 59 0 Total: 48 processes, 159 lwps, load averages: TIME CPU PROCESS/LWPID 0:00:03 2.8% filebench/4 0:00:03 2.8% filebench/3 0:00:03 2.8% filebench/7 0:00:03 2.7% filebench/5 0:00:03 2.7% filebench/8 0:00:03 2.7% filebench/2 0:00:03 2.7% filebench/9 0:00:03 2.6% filebench/6 0:00:01 0.1% filebench/1 0:00:00 0.0% prstat/1 0:00:00 0.0% sshd/1 0:08:40 0.0% Xorg/1 0:00:00 0.0% bash/1 0:03:05 0.0% dtgreet/1 0:00:02 0.0% nscd/14 1.25, 0.94, 0.51
78
prstat(1) - Microstates
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 21495 root 6.1 15 0.0 0.0 0.0 51 26 1.9 11K 4K .7M 0 filebench/7 21495 root 5.7 14 0.0 0.0 0.0 53 26 1.7 9K 4K .6M 0 filebench/3 21495 root 5.4 13 0.1 0.0 0.0 54 26 1.8 10K 4K .6M 0 filebench/5 21495 root 5.2 13 0.0 0.0 0.0 54 26 1.8 9K 4K .6M 0 filebench/4 21495 root 5.2 13 0.0 0.0 0.0 55 26 1.7 9K 4K .6M 0 filebench/6 21495 root 4.7 12 0.0 0.0 0.0 56 25 1.8 9K 4K .5M 0 filebench/9 21495 root 4.4 11 0.0 0.0 0.0 57 26 1.6 8K 3K .5M 0 filebench/8 21495 root 4.1 11 0.0 0.0 0.0 58 26 1.6 7K 3K .4M 0 filebench/2 21499 morgan 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 17 2 311 0 prstat/1 427 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 18 4 72 9 Xorg/1 489 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 26 1 45 0 dtgreet/1 471 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 2 2 6 0 snmpd/1 7 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 15 0 5 0 svc.startd/6 21462 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 13 0 5 0 filebench/2 514 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 15 0 47 0 nscd/23 Total: 48 processes, 159 lwps, load averages: 1.46, 1.03, 0.56
79
80
81
Dtrace - xcalls
SUNW,UltraSPARC-II`send_one_mondo+0x20 SUNW,UltraSPARC-II`send_mondo_set+0x1c unix`xt_some+0xc4 unix`xt_sync+0x3c unix`hat_unload_callback+0x6ec unix`bp_mapout+0x74 genunix`biowait+0xb0 ufs`ufs_putapage+0x3f4 ufs`ufs_putpages+0x2a4 genunix`segmap_release+0x300 ufs`ufs_dirremove+0x638 ufs`ufs_remove+0x150 genunix`vn_removeat+0x264 genunix`unlink+0xc unix`syscall_trap+0xac 17024 SUNW,UltraSPARC-II`send_one_mondo+0x20 SUNW,UltraSPARC-II`send_mondo_set+0x1c unix`xt_some+0xc4 unix`sfmmu_tlb_range_demap+0x190 unix`hat_unload_callback+0x6d4 unix`bp_mapout+0x74 genunix`biowait+0xb0 ufs`ufs_putapage+0x3f4 ufs`ufs_putpages+0x2a4 genunix`segmap_release+0x300 ufs`ufs_dirremove+0x638 ufs`ufs_remove+0x150 genunix`vn_removeat+0x264 genunix`unlink+0xc unix`syscall_trap+0xac 17025
82
lockstat(1M)
Provides for kernel lock statistics (mutex locks, reader/writer locks) Also serves as a kernel profiling tool Use -i 971 for the interval to avoid collisions with the clock interrupt, and gather fine-grained data
#lockstat -i 971 sleep 300 > lockstat.out #lockstat -i 971 -I sleep 300 > lockstatI.out
83
84
85
86
87
88
89
Session 2 - Memory
90
Virtual Memory
Simple programming model/abstraction Fault Isolation Security Management of Physical Memory Sharing of Memory Objects Caching
91
92
93
94
95
96
Mapped Files
reclaim
Page Lists
Free List
does not have a vnode/offset associated put on list at process exit. may be always small (pre Solaris 8) still have a vnode/offset seg_map free-behind and seg_vn executables and libraries (for reuse) reclaims are in vmstat "re"
Cache List
98
Page Scanning
Steals pages when memory is low Uses a Least Recently Used process. Puts memory out to "backing store" Kernel thread does the scanning
Write to backing store
Memory Page
Clearing bits
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
99
Page-out Thread
queue_io_request() N Y
Dirty Page push list
modified?
schedpaging()
- how many pages - how much CPU Free Page
checkpage()
page-out()
file system or specfs vop_putpage() routine
page-out_scanner()
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
Free Page
100
Scanning Algorithm
Free memory is lower than (lotsfree) Starts scanning @ slowscan (pages/sec) Scanner Runs:
> four times / second when memory is short > Awoken by page allocator if very low
Limits:
> Max # of pages /sec. swap device can handle > How much CPU should be used for scanning
scanrate = lotsfree - freemem lotsfree freemem lotsfree
101
x fastscan
+ slowscan x
Scanning Parameters
Parameter lotsfree Description starts stealing anonymous memory pages desfree scanner is started at 100 times/second minfree start scanning every time a new page is created throttlefree page_create routine makes the caller wait until free pages are available fastscan scan rate (pages per second) when free memory = minfree slowscan scan rate (pages per second) when free memory = lotsfree maxpgio max number of pages per second that the swap device can handle hand-spreadpages number of pages between the front hand (clearing) and back hand (checking) min_percent_cpu CPU usage when free memory is at lotsfree Min 512K minfree Default ( Solaris 8) 1/64 th of memory of lotsfee of desfree minfree slowscan minimum of 64MB/s or memory size 100 60 or 90 pages per spindle fastscan
~60 1
of a single CPU
102
Scan Rate
fastscan 8192 # pages scanned / second
1 GB Example
100 slowscan 16 MB
Scan Rate
8 MB
0 MB
4 MB
103
Cache List: pages with a valid vnode/offset Free List: pages has no vnode/offset
Unmapped pages where just released Non-dirty pages, not mapped, should be on the "free list" Places pages on the "tail" cache/free list Free memory = cache + free
104
reclaim
im cla re
105
106
Observability
Free memory now contains file system cache Higher free memory vmstat 'free' column is meaningful Easier visibility for memory shortages Scan rates != 0 - Memory shortage No tuning required delete all /etc/system VM parameters!
Correct Defaults
107
Memory Summary
Physical Memory:
# prtconf System Configuration: Sun Microsystems Memory size: 512 Megabytes sun4u
Kernel Memory:
# sar -k 1 1 SunOS ian 5.8 Generic_108528-03 sun4u 08/28/01 13:04:58 sml_mem alloc fail lg_mem alloc fail 13:04:59 10059904 7392775 0 133349376 92888024
Free Memory:
# vmstat 3 3 procs memory r b w swap free re 0 0 0 478680 204528 0 0 0 0 415184 123400 0 0 0 0 415200 123416 0 page disk pi po fr de sr f0 s0 s1 s6 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 faults cpu in sy cs us sy 209 1886 724 35 5 238 825 451 2 1 219 788 427 1 1
mf 2 2 0
id 61 98 98
108
109
vmstat
r b = run queue length = processes blocked waiting for I/O w = idle processes that have been swapped at some time swap = free and unreserved swap in KBytes free = free memory measured in pages re = kilobytes reclaimed from cache/free list mf = minor faults - the page was in memory but was not mapped pi = kilobytes paged-in from the file system or swap device po = kilobytes paged-out to the file system or swap device fr = kilobytes that have been destroyed or freed de = kilobytes freed after writes sr = pages scanned / second s0-s3 = disk I/Os per second for disk 0-3 in = interrupts / second sy = system calls / second cs = context switches / second us = user cpu time sy = kernel cpu time id = idle + wait cpu time # vmstat 5 5 procs memory r b w swap free ... 0 0 0 46580232 337472 0 0 0 45311368 336280 0 0 0 46579816 337048 0 0 0 46580128 337176 re 18 32 12 3 mf 194 249 216 111 page pi po 30 0 48 0 60 0 3 0 fr 0 0 0 0 de 0 0 0 0 sr 0 0 0 0 disk f0 s0 0 0 0 0 0 10 0 0 s1 0 0 0 0 s2 0 0 7 0 faults in sy 5862 81260 6047 93562 5742 100944 5569 93338 cs 28143 29039 27032 26204 cpu us sy 19 7 21 10 20 7 21 6 id 74 69 73 73
110
vmstat -p
swap free = free and unreserved swap in KBytes = free memory measured in pages re = kilobytes reclaimed from cache/free list mf = minor faults - the page was in memory but was not mapped fr = kilobytes that have been destroyed or freed de = kilobytes freed after writes sr = kilobytes scanned / second executable pages: kilobytes in - out - freed anonymous pages: kilobytes in - out freed file system pages: kilobytes in - out freed # vmstat -p 5 5 memory swap free 46715224 891296 46304792 897312 45886168 899808 46723376 899440
re 24 151 118 29
de 0 0 0 0
sr 0 0 0 0
anonymous filesystem api apo apf fpi fpo fpf 4 0 0 27 0 0 1 0 0 280 25 25 1 0 0 641 1 1 40 0 0 60 0 0
111
Swapping
Scheduler/Dispatcher:
Dramatically affects process performance Used when demand paging is not enough Avg. freemem below desfree for 30 sec. Look for inactive processes, at least maxslp Run queue >= 2 (waiting for CPU) Avg. freemem below desfree for 30 sec. Excessive paging, (pageout + pagein ) > maxpgio Aggressive; unload kernel mods & free cache
Usenix '06 Boston, Massachusetts
Soft swapping:
Hard swapping:
112
Allocated:
> Virtual space is allocated when the first physical page is
Swapped out:
> When a shortage occurs > Page is swapped out by the scanner, migrated to swap storage
113
Swap Space
Free Virtual Swap Unallocated Virtual Swap
Reserved Swap All Virtual Swap
114
Swap Usage
Virtual Swap:
# swap -s total: 175224k bytes unallocated + 24464k allocated = 199688k reserved, 416336k available
Physical Swap:
115
Paging Activity
> Use vmstat -p to check if there are anonymous page-ins
Attribution
> Use DTrace to see which processes/files are causing paging
116
117
118
sr 0 0 0 0
119
anonymous filesystem api apo apf fpi fpo fpf 0 0 0 0 1 1 0 3238 3238 10 9391 10630 12 19934 19930 95 16548 16591 23 28739 28804 56 547 556
120
execution time for target thread/process > DFL shows time spent waiting in major faults in anon:
sol8$ prstat -mL PID USERNAME USR 15625 rmc 0.1 15652 rmc 0.1 15635 rmc 0.1 15626 rmc 0.1 15712 rmc 0.1 15628 rmc 0.1 15725 rmc 0.0 15719 rmc 0.0 15614 rmc 0.0 SYS 0.7 0.7 0.7 0.6 0.5 0.5 0.4 0.4 0.3 TRP 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 0.0 95 0.0 0.9 3.2 1K 726 88 0 filebench/2 0.0 94 0.0 1.8 3.6 1K 1K 10 0 filebench/2 0.0 96 0.0 0.5 3.2 1K 1K 8 0 filebench/2 0.0 95 0.0 1.4 2.6 1K 813 10 0 filebench/2 0.0 47 0.0 49 3.8 1K 831 104 0 filebench/2 0.0 96 0.0 0.0 3.1 1K 735 4 0 filebench/2 0.0 92 0.0 1.7 5.7 996 736 8 0 filebench/2 0.0 40 40 17 2.9 1K 708 107 0 filebench/2 0.0 92 0.0 4.7 2.4 874 576 40 0 filebench/2
121
most probes, this argument is always 1, but for some it may take other values; these probes are noted in Table 5-4. > arg1: a pointer to the current value of the statistic to be incremented. This value is a 64bit quantity that is incremented by the value in arg0. Dereferencing this pointer allows consumers to determine the current count of the statistic corresponding to the probe.
122
sol10$ dtrace -n anonpgin '{@[execname] = count()}' dtrace: description anonpgin matched 1 probe svc.startd sshd ssh dtrace vmstat filebench
1 2 3 6 28 913
123
sol10$ ./whospaging.d Who's waiting for pagein (milliseconds): wnck-applet gnome-terminal Who's on cpu (milliseconds): wnck-applet gnome-terminal metacity Xorg sched 21 75 13 14 23 90 3794
124
125
126
127
128
Large Memory
Large Memory in Perspective 64-bit Solaris 64-bit Hardware Solaris enhancements for Large Memory Large Memory Databases Configuring Solaris for Large Memory Using larger page sizes
129
Virtual Machines
> 1 Address space for all objects, JVM today is 100GB+
Scientific/Simulation/Modelling
> Oil/Gas, Finite element, Bioinformatics models 500GB+ > Medium size mechanical models larger than 4GB
130
131
132
Solaris
Full 64-bit support (Solaris 7 and beyond) ILP32 Apps ILP32 Libs ILP32 Kernel ILP32 Drivers 32-bit H/W
Copyright 2006 Richard McDougall & James Mauro
LP64 Apps LP64 Libs LP64 Kernel LP64 Drivers 64-bit H/W
133
64-bit Solaris
LP64 Data Model 32-bit or 64-bit kernel, with 32-bit & 64-bit application support
> 64-bit on SPARC > Solaris 10 64-bit on AMD64 (Opteron, Athlon)
134
135
Developer Perspective
Virtually unlimited address space
> Data objects, files, large hardware devices can be mapped into
virtual address space > 64-bit data types, parameter passing > Caching can be implemented in application, yielding much higher performance
Small Overheads
136
Exploiting 64-bits
Commercial: Java Virtual Machine, SAP, Microfocus Cobol, ANTS, XMS, Multigen RDBMS: Oracle, DB2, Sybase, Informix, Times Ten Mechanical/Design: PTC, Unigraphics, Mentor Graphics, Cadence, Synopsis etc... Supply Chain: I2, SAP, Manugistics HPC: PTC, ANSYS, ABAQUS, Nastran, LS-Dyna, Fluent etc...
137
138
139
Solaris 8, 2/02
> > > >
Large working sets MMU perf Raise 8GB limit to 128GB Dump Performance improved Boot performance improved
Solaris 9
> Generic multiple page size facility and tools
Solaris 10
> Large kernel pages
140
Configuring Solaris
fsflush uses too much CPU on Solaris 8
> Set autoup in /etc/system > Symptom is one CPU using 100%sys
Corrective Action
> Default is 30s, recommend setting larger > e.g. 10x nGB of memory
141
142
Databases
Exploit memory to reduce/eliminate I/O! Eliminating I/O is the easiest way to tune it... Increase cache hit rates:
> 95% means 1 out 20 accesses result in I/O > 99% means 1 out of 100 500% reduction in I/O!
We can often fit entire RDBMS in memory Write-mostly I/O pattern results
143
144
64-Bit Oracle
Required to cache more than 3.75GB Available since DBMS 8.1.7 Sun has tested up to 540GB SGA Recommended by Oracle and Sun Cache for everything except PQ Pay attention to cold-start times
145
Solaris 9/10
> Multiple Page Size Support (MPSS)
> Optional large pages for heap/stack > Programmatically via madvise() > Shared library for existing binaries (LD_PRELOAD) > Tool to observe potential gains
# trapstat -t
146
147
TLB Spread Exceeded 2GB TSB Spread Exceeded 8GB [@128GB w/S8U7]
148
Trapstat Introduction
sol9# trapstat -t 1 111 cpu m| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim -----+-------------------------------+-------------------------------+---0 u| 1 0.0 0 0.0 | 2171237 45.7 0 0.0 |45.7 0 k| 2 0.0 0 0.0 | 3751 0.1 7 0.0 | 0.1 =====+===============================+===============================+==== ttl | 3 0.0 0 0.0 | 2192238 46.2 7 0.0 |46.2
149
0 u 512k|
- - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - -
0 k 512k|
150
sol9# trapstat -t 1 111 cpu m| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim -----+-------------------------------+-------------------------------+---0 u| 0 k| ttl | 1 2 3 0.0 0.0 0.0 0 0 0 0.0 | 0.0 | 0.0 | 2171237 45.7 3751 0.1 0 7 7 0.0 |45.7 0.0 | 0.1 0.0 |46.2
151
AMD64
solaris10> isainfo amd64 i386 solaris10> pagesize -a 4096 2097152 solaris10>
152
153
154
155
156
157
Optimized for 8k Only one large page size 7 TLB entries for large pages Pick from 64k, 512k, 4M
UltraSPARC IV
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
158
Solaris 9 & 10
> Multiple Page Size Support (MPSS)
> Optional large pages for heap/stack > Programmatically via madvise() > Shared library for existing binaries (LD_PRELOAD) > Tool to observe potential gains
# trapstat -t
159
160
Example Program
#include <sys/types.h> const char * const_str = "My const string"; char * global_str = "My global string"; int global_int = 42; int main(int argc, char * argv[]) { int local_int = 123; char * s; int i; char command[1024]; global_int = 5; s = (char *)malloc(14000); s[0] = 'a'; s[100] = 'b'; s[8192] = 'c'; }
161
Virtual to Physical
MMU V P
Stack Libs
Heap
Data Text
0x000
162
Address Space
Process Address Space
> Process Text and Data > Stack (anon memory) and Libraries > Heap (anon memory)
Kernel Text and Data Kernel Map Space (data structs, caches) 32-bit Kernel map (64-bit Kernels only) Trap table Critical virtual memory data structures Mapping File System Cache (segmap)
163
32-bit sun4u
Stack
Stack
0xFFFFFFFF.7F7F0000
Libraries
0xFFFFF7FF.FFFFFFFF 0x00000800.00000000
Libraries
VA Hole
64-bit amd64
Stack
0xE0000000
Libraries
0xFFFFFFFF.7F7F0000
Libraries
0xFFFFF7FF.FFFFFFFF 0x00000800.00000000
VA Hole
0x8048000
0xE0000000
Libraries
0x8048000
0x0
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
Stack
0xEF7EA000
0xDFFFE000
Stack
0xDF7F9000
Libraries
Libraries
0x00010000
Copyright 2006 Richard McDougall & James Mauro
0x00010000
pmap -x
Sol8# /usr/proc/bin/pmap -x $$ 18084: csh Address Kbytes Resident Shared Private Permissions 00010000 144 144 136 8 read/exec 00044000 16 16 16 read/write/exec 00048000 120 104 104 read/write/exec FF200000 672 624 600 24 read/exec FF2B8000 24 24 24 read/write/exec FF2BE000 8 8 8 read/write/exec FF300000 16 16 8 8 read/exec FF320000 8 8 8 read/exec FF332000 8 8 8 read/write/exec FF340000 8 8 8 read/write/exec FF350000 168 112 88 24 read/exec FF38A000 32 32 32 read/write/exec FF392000 8 8 8 read/write/exec FF3A0000 8 8 8 read/exec FF3B0000 136 136 128 8 read/exec FF3E2000 8 8 8 read/write/exec FFBE6000 40 40 40 read/write/exec -------- ------ ------ ------ -----total Kb 1424 1304 960 344 Mapped File csh csh [ heap ] libc.so.1 libc.so.1 libc.so.1 libc_psr.so.1 libmapmalloc.so.1 libmapmalloc.so.1 [ anon ] libcurses.so.1 libcurses.so.1 libcurses.so.1 libdl.so.1 ld.so.1 ld.so.1 [ stack ]
168
Max Heap Size 2 GBytes 2 GBytes 3.75 GBytes 3.75 GBytes 3.75 GBytes 3.75 / 3.90 GBytes 16 TBytes (Ultra) 3.75 / 3.90 GBytes 16 TBytes (Ultra) 3.90GB 16 TBytes (Ultra) <TBD> 16 TBytes (Ultra)
Usenix '06 Boston, Massachusetts
Notes
170
Page Faults
MMU-generated exception: Major Page Fault:
> > > >
Failed access to VM location, in a segment Page does not exist in physical memory New page is created or copied from swap If addr not in a valid segment (SIG-SEGV)
seg_fault()
page
2
swapfs
5 swap space
6
172
vmstat -p
swap free = free and unreserved swap in KBytes = free memory measured in pages re = kilobytes reclaimed from cache/free list mf = minor faults - the page was in memory but was not mapped fr = kilobytes that have been destroyed or freed de = kilobytes freed after writes sr = kilobytes scanned / second executable pages: kilobytes in - out - freed anonymous pages: kilobytes in - out freed file system pages: kilobytes in - out freed # vmstat -p 5 5 memory swap free 46715224 891296 46304792 897312 45886168 899808 46723376 899440
re 24 151 118 29
de 0 0 0 0
sr 0 0 0 0
anonymous filesystem api apo apf fpi fpo fpf 4 0 0 27 0 0 1 0 0 280 25 25 1 0 0 641 1 1 40 0 0 60 0 0
173
The dtrace VM provider provides a probe for each VM statistic We can observe all VM statistics via kstat:
instance: 0 class: misc 0 0 0 3180528 37280 463.343064 0 0 442 0 0 2103 0 0 0 912
$ kstat -n vm module: cpu name: vm anonfree anonpgin anonpgout as_fault cow_fault crtime dfree execfree execpgin execpgout fsfree fspgin fspgout hat_fault kernel_asflt maj_fault
174
kthr memory page disk faults r b w swap free re mf 0 1 0 1341844 836720 26 311 0 1 0 1341344 835300 238 934 0 1 0 1340764 833668 24 165 0 1 0 1340420 833024 24 394 0 1 0 1340068 831520 14 202
The pi column in the above output denotes the number of pages paged in. The vminfo provider makes it easy to learn more about the source of these page-ins:
dtrace -n pgin {@[execname] = count()} dtrace: description pgin matched 1 probe ^C xterm 1 ksh 1 ls 2 lpstat 7 sh 17 soffice 39 javaldx 103 soffice.bin 3065
175
From the above, we can see that a process associated with the StarOffice Office Suite, soffice.bin, is reponsible for most of the page-ins. To get a better picture of soffice.bin in terms of VM behavior, we may wish to enable all vminfo probes. In the following example, we run dtrace(1M) while launching StarOffice:
dtrace -P vminfo/execname == "soffice.bin"/{@[probename] = count()} dtrace: description vminfo matched 42 probes ^C pgout 16 anonfree 16 anonpgout 16 pgpgout 16 dfree 16 execpgin 80 prot_fault 85 maj_fault 88 pgin 90 pgpgin 90 cow_fault 859 zfod 1619 pgfrec 8811 pgrec 8827 as_fault 9495
176
To further drill down on some of the VM behavior of StarOffice during startup, we could write the following D script:
vminfo:::maj_fault, vminfo:::zfod, vminfo:::as_fault /execname == "soffice.bin" && start == 0/ { /* * This is the first time that a vminfo probe has been hit; record * our initial timestamp. */ start = timestamp; } vminfo:::maj_fault, vminfo:::zfod,vminfo:::as_fault /execname == "soffice.bin"/ { /* * Aggregate on the probename, and lquantize() the number of seconds * since our initial timestamp. (There are 1,000,000,000 nanoseconds * in a second.) We assume that the script will be terminated before * 60 seconds elapses. */ @[probename] = lquantize((timestamp - start) / 1000000000, 0, 60); }
177
count 0 88 194 18 0 0 2 0 1 82 0 0 2 0
178
179
Copy-on-write
Mapped File
Libraries Copy on write remaps pagesize address to anonymous memory (swap space) swap
Mapped File
mapped file
Usenix '06 Boston, Massachusetts
Anonymous Memory
Pages not "directly" backed by a vnode Heap, Stack and Copy-On-Write pages Pages are reserved when "requested" Pages allocated when "touched" Anon layer:
> creates slot array for pages > Slots point to Anon structs
Swapfs layer:
> Pseudo file system for anon layer > Provides the backing store
182
183
ISM
Process B
Process B
ISM
184
non-ISM
185
Process/Threads Glossary
Process Thread User Thread Kernel Thread Dispatcher Scheduling Class Dispatch Queues Sleep Queues Turnstiles The executable form of a program. An Operating System abstraction that encapulates the execution context of a program An executable entity A thread within the address space of a process A thread in the address space of the kernel The kernel subsystem that manages queues of runnable kernel threads Kernel classes that define the scheduling parameters (e.g. priorities) and algorithms used to multiplex threads onto processors Per-processor sets of queues of runnable threads (run queues) Queues of sleeping threads A special implementation of sleep queues that provide priority inheritance.
186
Executable Files
Processes originate as executable programs that are exec'd Executable & Linking Format (ELF)
> Standard executable binary file Application Binary Interface
(ABI) format > Two standards components > Platform independent > Platform dependent (SPARC, x86) > Defines both the on-disk image format, and the in-memory image > ELF files components defined by > ELF header > Program Header Table (PHT) > Section Header Table (SHT)
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
187
PHT
> Array of Elf_Phdr
SHT
> Array of Elf_Shdr
SHT
188
ELF Files
ELF on-disk object created by the link-editor at the tailend of the compilation process (although we still call it an a.out by default...) ELF objects can be statically linked or dynamically linked
> Compiler "-B static" flag, default is dynamic > Statically linked objects have all references resolved and bound
in the binary (libc.a) > Dynamically linked objects rely on the run-time linker, ld.so.1, to resolve references to shared objects at run time (libc.so.1) > Static linking is discouraged, and not possible for 64-bit binaries
189
ei_data: e_version:
ELFDATA2MSB EV_CURRENT
52 40 32
26 27 6
190
191
borntorun> pldd $$ 495: ksh /usr/lib/libsocket.so.1 /usr/lib/libnsl.so.1 /usr/lib/libc.so.1 /usr/lib/libdl.so.1 /usr/lib/libmp.so.2 /usr/platform/sun4u/lib/libc_psr.so.1 /usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.2 borntorun>
192
193
194
195
196
197
Solaris Process
proc_t vnode_t
memory pages
page_t
a.out vnode
as_t
address space
cred_t
lineage pointers
sess_t
credentials session
LWP &kernel thread stuff signal management /proc support resource usage microstate
hardware context
vnode
hardware context
memory pages
hardware context
LWP
LWP
LWP
kernel
kthread_t thread scheduling class
kernel thread
scheduling class data
kernel thread
scheduling class data
practive
198
Process Structure
# mdb -k Loading modules: [ unix krtld genunix specfs dtrace ufs ip sctp usba fctl nca lofs nfs random sppp crypto ptm logindmux cpc ] > ::ps S PID PPID PGID SID UID FLAGS ADDR NAME R 0 0 0 0 0 0x00000001 fffffffffbc1ce80 sched R 3 0 0 0 0 0x00020001 ffffffff880838f8 fsflush R 2 0 0 0 0 0x00020001 ffffffff88084520 pageout R 1 0 0 0 0 0x42004000 ffffffff88085148 init R 21344 1 21343 21280 2234 0x42004000 ffffffff95549938 tcpPerfServer ... > ffffffff95549938::print proc_t { p_exec = 0xffffffff9285dc40 p_as = 0xffffffff87c776c8 p_cred = 0xffffffff8fdeb448 p_lwpcnt = 0x6 p_zombcnt = 0 p_tlist = 0xffffffff8826bc20 ..... u_ticks = 0x16c6f425 u_comm = [ "tcpPerfServer" ] u_psargs = [ "/export/home/morgan/work/solaris_studio9/bin/tcpPerfServer 9551 9552" ] u_argc = 0x3 u_argv = 0x8047380 u_envp = 0x8047390 u_cdir = 0xffffffff8bf3d7c0 u_saved_rlimit = [ { rlim_cur = 0xfffffffffffffffd rlim_max = 0xfffffffffffffffd } ...... fi_nfiles = 0x3f fi_list = 0xffffffff8dc44000 fi_rlist = 0 } p_model = 0x100000 p_rctls = 0xffffffffa7cbb4c8 p_dtrace_probes = 0 p_dtrace_count = 0 p_dtrace_helpers = 0 p_zone = zone0
199
200
201
8K sleep 59 0 0:00:00 0.0% 696K sleep 59 0 0:00:08 0.0% 0K sleep 59 0 0:00:00 0.0% 8K sleep 59 0 0:00:00 0.0% 536K sleep 59 0 0:00:00 0.0% 8K sleep 59 0 0:00:00 0.0% 216K sleep 59 0 0:00:00 0.0% 872K sleep 59 0 0:00:01 0.0% 16K sleep 60 -20 0:00:00 0.0% 311 lwps, load averages: 0.48, 0.48,
202
> SIDL state > exec(2) overlays newly created process with executable image
State Transitions
> Typically runnable (SRUN), running (SONPROC) or sleeping
Termination
> SZOMB state > implicit or explicit exit(), signal (kill), fatal error
203
Process Creation
Traditional UNIX fork/exec model
> fork(2) - replicate the entire process, including all threads > fork1(2) - replicate the process, only the calling thread > vfork(2) - replicate the process, but do not dup the address
space > The new child borrows the parent's address space, until exec()
main(int argc, char *argv[]) { pid_t pid; pid = fork(); if (pid == 0) /* in the child */ exec(); else if (pid > 0) /* in the parent */ wait(); else fork failed }
204
fork(2) in Solaris 10
Solaris 10 unified the process model
> libthread merged with libc > threaded and non-threaded processes look the same
fork1(2) behaviour
forkall(2) added for applications that require a fork to replicate all the threads in the process
205
project ID, session membership, real UID & GID, current working directory, resource limits, processor binding, times, etc, ...
206
207
208
209
Watching Forks
D script for watching fork(2)
#!/usr/sbin/dtrace -qs syscall::forkall:entry { @fall[execname] = count(); } syscall::fork1:entry { @f1[execname] = count(); } syscall::vfork:entry { @vf[execname] = count(); } dtrace:::END { printf("forkall\n"); printa(@fall); printf("fork1\n"); printa(@f1); printf("vfork\n"); printa(@vf); }
Example run
# ./watchfork.d ^C forkall fork1 start-srvr bash 4cli vfork 1 3 6
210
project ID, session membership, real UID & GID, current working directory, resource limits, processor binding, times, etc, ...
211
212
213
214
State Transitions
PINNED intr fork() swtch() IDL RUN preempt prun(1)
STOPPED
pstop(1)
pthread_exit() ZOMBIE
reap FREE
215
Total: 127 processes, 312 lwps, load averages: 0.62, 0.62, 0.53
216
Microstates
Fine-grained state tracking for processes/threads
> Off by default in Solaris 8 and Solaris 9 > On by default in Solaris 10
217
218
219
Solaris 8 ptools
/usr/bin/pflags [ -r ] [ pid | core ] ... /usr/bin/pcred [ pid | core ] ... /usr/bin/pmap [ -rxlF ] [ pid | core ] ... /usr/bin/pldd [ -F ] [ pid | core ] ... /usr/bin/psig pid ... /usr/bin/pstack [ -F ] [ pid | core ] ... /usr/bin/pfiles [ -F ] pid ... /usr/bin/pwdx [ -F ] pid ... /usr/bin/pstop pid ... /usr/bin/prun pid .. /usr/bin/pwait [ -v ] pid ... /usr/bin/ptree [ -a ] [ [ pid | user ] ... ] /usr/bin/ptime command [ arg ... ] /usr/bin/pgrep [ -flnvx ] [ -d delim ] [ -P ppidlist ] [ -g pgrplist ] [ -s sidlist ] [ -u euidlist ] [ -U uidlist ] [ -G gidlist ] [ -J projidlist ] [ -t termlist ] [ -T taskidlist ] [ pattern ] /usr/bin/pkill [ -signal ] [ -fnvx ] [ -P ppidlist ] [ -g pgrplist ] [ -s sidlist ] [ -u euidlist ] [ -U uidlist] [ -G gidlist ] [ -J projidlist ] [ -t termlist ] [-T taskidlist ] [ pattern ]
220
Solaris 9 / 10 ptools
/usr/bin/pflags [-r] [pid | core] ... /usr/bin/pcred [pid | core] ... /usr/bin/pldd [-F] [pid | core] ... /usr/bin/psig [-n] pid... /usr/bin/pstack [-F] [pid | core] ... /usr/bin/pfiles [-F] pid... /usr/bin/pwdx [-F] pid... /usr/bin/pstop pid... /usr/bin/prun pid... /usr/bin/pwait [-v] pid... /usr/bin/ptree [-a] [pid | user] ... /usr/bin/ptime command [arg...] /usr/bin/pmap -[xS] [-rslF] [pid | core] ... /usr/bin/pgrep [-flvx] [-n | -o] [-d delim] [-P ppidlist] [g pgrplist] [-s sidlist] [-u euidlist] [-U uidlist] [G gidlist] [-J projidlist] [-t termlist] [-T taskidlist] [pattern] /usr/bin/pkill [-signal] [-fvx] [-n | -o] [-P ppidlist] [g pgrplist] [-s sidlist] [-u euidlist] [-U uidlist] [G gidlist] [-J projidlist] [-t termlist] [-T taskidlist] [pattern] /usr/bin/plimit [-km] pid... {-cdfnstv} soft,hard... pid... /usr/bin/ppgsz [-F] -o option[,option] cmd | -p pid... /usr/bin/prctl [-t [basic | privileged | system] ] [ -e | -d action] [-rx] [ -n name [-v value]] [-i idtype] [id...] /usr/bin/preap [-F] pid /usr/bin/pargs [-aceFx] [pid | core] ...
221
222
psig
sol8$ psig $$ 15481: -zsh HUP caught 0 INT blocked,caught 0 QUIT blocked,ignored ILL blocked,default TRAP blocked,default ABRT blocked,default EMT blocked,default FPE blocked,default KILL default BUS blocked,default SEGV blocked,default SYS blocked,default PIPE blocked,default ALRM blocked,caught 0 TERM blocked,ignored USR1 blocked,default USR2 blocked,default CLD caught 0 PWR blocked,default WINCH blocked,caught 0 URG blocked,default POLL blocked,default STOP default
223
pstack
sol8$ pstack 5591 5591: /usr/local/mozilla/mozilla-bin ----------------- lwp# 1 / thread# 1 -------------------fe99a254 poll (513d530, 4, 18) fe8dda58 poll (513d530, fe8f75a8, 18, 4, 513d530, ffbeed00) + 5c fec38414 g_main_poll (18, 0, 0, 27c730, 0, 0) + 30c fec37608 g_main_iterate (1, 1, 1, ff2a01d4, ff3e2628, fe4761c9) + 7c0 fec37e6c g_main_run (27c740, 27c740, 1, fe482b30, 0, 0) + fc fee67a84 gtk_main (b7a40, fe482874, 27c720, fe49c9c4, 0, 0) + 1bc fe482aa4 ???????? (d6490, fe482a6c, d6490, ff179ee4, 0, ffe) fe4e5518 ???????? (db010, fe4e5504, db010, fe4e6640, ffbeeed0, 1cf10) 00019ae8 ???????? (0, ff1c02b0, 5fca8, 1b364, 100d4, 0) 0001a4cc main (0, ffbef144, ffbef14c, 5f320, 0, 0) + 160 00014a38 _start (0, 0, 0, 0, 0, 0) + 5c ----------------- lwp# 2 / thread# 2 -------------------fe99a254 poll (fe1afbd0, 2, 88b8) fe8dda58 poll (fe1afbd0, fe840000, 88b8, 2, fe1afbd0, 568) + 5c ff0542d4 ???????? (75778, 2, 3567e0, b97de891, 4151f30, 0) ff05449c PR_Poll (75778, 2, 3567e0, 0, 0, 0) + c fe652bac ???????? (75708, 80470007, 7570c, fe8f6000, 0, 0) ff13b5f0 Main__8nsThreadPv (f12f8, ff13b5c8, 0, 0, 0, 0) + 28 ff055778 ???????? (f5588, fe840000, 0, 0, 0, 0) fe8e4934 _lwp_start (0, 0, 0, 0, 0, 0)
224
pfiles
sol8$ pfiles $$ pfiles $$ 15481: -zsh Current rlimit: 256 file descriptors 0: S_IFCHR mode:0620 dev:118,0 ino:459678 uid:36413 gid:7 rdev:24,11 O_RDWR 1: S_IFCHR mode:0620 dev:118,0 ino:459678 uid:36413 gid:7 rdev:24,11 O_RDWR 2: S_IFCHR mode:0620 dev:118,0 ino:459678 uid:36413 gid:7 rdev:24,11 O_RDWR 3: S_IFDOOR mode:0444 dev:250,0 ino:51008 uid:0 gid:0 size:0 O_RDONLY|O_LARGEFILE FD_CLOEXEC door to nscd[328] 10: S_IFCHR mode:0620 dev:118,0 ino:459678 uid:36413 gid:7 rdev:24,11 O_RDWR|O_LARGEFILE
225
226
pgrep
sol8$ pgrep -u rmc 481 480 478 482 483 484 .....
227
Tracing
Trace user signals and system calls - truss
> Traces by stopping and starting the process > Can trace system calls, inline or as a summary > Can also trace shared libraries and a.out
Linker/library interposing/profiling/tracing
> LD_ environment variables enable link debugging > man ld.so.1 > using the LD_PRELOAD env variable
Kernel Tracing
> lockstat, tnf, kgmon
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
228
229
230
231
232
User Threads
The programming abstraction for creating multithreaded programs
> Parallelism > POSIX and UI thread APIs > thr_create(3THR) > pthread_create(3THR) > Synchronization > Mutex locks, reader/writer locks, semaphores, condition variables
233
234
User level thread synchronization - threads sleep at user level. (Process private only) Concurrency via set_concurrency() and bound LWPs
235
T1 Multilevel Model
Unbound Thread Implementation
> > > > > > > >
User Level scheduling Unbound threads switched onto available lwps Threads switched when blocked on sync object Thread temporary bound when blocked in system call Daemon lwp to create new lwps Signal direction handled by Daemon lwp Reaper thread to manage cleanup Callout lwp for timers
236
LWP
kernel thread
LWP
kernel thread
LWP
kernel thread
237
T1 Multilevel Model
Pros:
> Fast user thread create and destroy > Allows many-to-few thread model, to mimimize the number of kernel > > > > >
threads and LWPs Uses minimal kernel memory No system call required for synchronization Process Private Synchronization only Can have thousands of threads Fast context-switching
Cons:
> Complex, and tricky programming model wrt achieving good scalability -
need to bind or use set_concurrency() > Signal delivery > Compute bound threads do not surrender, leading to excessive CPU consumption and potential starving > Complex to maintain (for Sun)
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
238
239
user kernel
process
LWP
kernel thread
LWP
kernel thread
LWP
kernel thread
LWP
kernel thread
240
of a thread that is sleeping > Threads that rely on fairness of scheduling/CPU could end up ping-ponging, at the expense of another thread which has work to do.
> One out of every 16 queuing operations will put a thread at the end of the queue, to
> The maximum number of stacks the library retains after threads exit for re-use when more threads are created is 10.
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
241
pae1> truss -p 2975/3 /3: close(5) = 0 /3: open("/space1/3", O_RDWR|O_CREAT, 0666) = 5 /3: lseek(5, 0, SEEK_SET) = 0 /3: write(5, " U U U U U U U U U U U U".., 1056768) /3: lseek(5, 0, SEEK_SET) = 0 /3: read(5, " U U U U U U U U U U U U".., 1056768) /3: close(5) = 0 /3: open("/space1/3", O_RDWR|O_CREAT, 0666) = 5 /3: lseek(5, 0, SEEK_SET) = 0 /3: write(5, " U U U U U U U U U U U U".., 1056768)
= 1056768 = 1056768
= 1056768
242
Thread Microstates
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 918 rmc 0.2 0.4 0.0 0.0 0.0 0.0 99 0.0 27 2 1K 0 prstat/1 919 mauroj 0.1 0.4 0.0 0.0 0.0 0.0 99 0.1 44 12 1K 0 prstat/1 907 root 0.0 0.1 0.0 0.0 0.0 0.0 97 3.1 121 2 20 0 filebench/2 913 root 0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 15 2 420 0 filebench/2 866 root 0.0 0.0 0.0 0.0 0.0 0.0 96 4.1 44 41 398 0 filebench/2 820 root 0.0 0.0 0.0 0.0 0.0 0.0 95 5.0 43 42 424 0 filebench/2 814 root 0.0 0.0 0.0 0.0 0.0 0.0 95 5.0 43 41 424 0 filebench/2 772 root 0.0 0.0 0.0 0.0 0.0 0.0 96 3.6 46 39 398 0 filebench/2 749 root 0.0 0.0 0.0 0.0 0.0 0.0 96 3.7 45 41 398 0 filebench/2 744 root 0.0 0.0 0.0 0.0 0.0 0.0 95 4.7 47 39 398 0 filebench/2 859 root 0.0 0.0 0.0 0.0 0.0 0.0 95 4.9 44 41 424 0 filebench/2 837 root 0.0 0.0 0.0 0.0 0.0 0.0 96 4.0 43 43 405 0 filebench/2 [snip] 787 root 0.0 0.0 0.0 0.0 0.0 0.0 95 4.5 43 41 424 0 filebench/2 776 root 0.0 0.0 0.0 0.0 0.0 0.0 95 4.8 43 42 398 0 filebench/2 774 root 0.0 0.0 0.0 0.0 0.0 0.0 96 4.2 43 40 398 0 filebench/2 756 root 0.0 0.0 0.0 0.0 0.0 0.0 96 3.8 44 41 398 0 filebench/2 738 root 0.0 0.0 0.0 0.0 0.0 0.0 96 4.4 43 42 398 0 filebench/2 735 root 0.0 0.0 0.0 0.0 0.0 0.0 96 3.9 47 39 405 0 filebench/2 734 root 0.0 0.0 0.0 0.0 0.0 0.0 96 4.3 44 41 398 0 filebench/2 727 root 0.0 0.0 0.0 0.0 0.0 0.0 96 4.4 43 43 398 0 filebench/2 725 root 0.0 0.0 0.0 0.0 0.0 0.0 96 4.4 43 43 398 0 filebench/2 Total: 257 processes, 3139 lwps, load averages: 7.71, 2.39, 0.97
243
Watching Threads
PID USERNAME SIZE 29105 root 5400K 29051 root 5072K 202 root 3304K 25947 root 5160K 23078 root 20M 25946 rmc 3008K 23860 root 5248K 29100 root 1272K 24866 root 5136K 340 root 2504K 23001 root 5136K 830 root 2472K 829 root 2488K 1 root 2184K 202 root 3304K 202 root 3304K 202 root 3304K 202 root 3304K 202 root 3304K 202 root 3304K 202 root 3304K 202 root 3304K 202 root 3304K 202 root 3304K 202 root 3304K 202 root 3304K 202 root 3304K 126 daemon 2360K 814 root 1936K 64 root 2952K 64 root 2952K 64 root 2952K 64 root 2952K 64 root 2952K 61 daemon 3640K 61 daemon 3640K 61 daemon 3640K 55 root 2416K 55 root 2416K 55 root 2416K 55 root 2416K Total: 125 processes, RSS STATE PRI NICE TIME CPU PROCESS/LWPID 3032K sleep 60 0 0:00:00 1.3% pkginstall/1 4768K cpu0 49 0 0:00:00 0.8% prstat/1 1256K sleep 59 0 0:00:07 0.3% nscd/23 608K sleep 59 0 0:00:05 0.2% sshd/1 1880K sleep 59 0 0:00:58 0.2% lupi_zones/1 624K sleep 59 0 0:00:02 0.2% ssh/1 688K sleep 59 0 0:00:06 0.2% sshd/1 976K sleep 59 0 0:00:00 0.1% mpstat/1 600K sleep 59 0 0:00:02 0.0% sshd/1 672K sleep 59 0 0:11:14 0.0% mibiisa/2 584K sleep 59 0 0:00:04 0.0% sshd/1 600K sleep 59 0 0:11:01 0.0% mibiisa/2 648K sleep 59 0 0:11:01 0.0% mibiisa/2 400K sleep 59 0 0:00:01 0.0% init/1 1256K sleep 59 0 0:00:00 0.0% nscd/13 1256K sleep 59 0 0:00:00 0.0% nscd/12 1256K sleep 59 0 0:00:00 0.0% nscd/11 1256K sleep 59 0 0:00:00 0.0% nscd/10 1256K sleep 59 0 0:00:00 0.0% nscd/9 1256K sleep 59 0 0:00:00 0.0% nscd/8 1256K sleep 59 0 0:00:00 0.0% nscd/7 1256K sleep 59 0 0:00:00 0.0% nscd/6 1256K sleep 59 0 0:00:00 0.0% nscd/5 1256K sleep 59 0 0:00:00 0.0% nscd/4 1256K sleep 59 0 0:00:00 0.0% nscd/3 1256K sleep 59 0 0:00:00 0.0% nscd/2 1256K sleep 59 0 0:00:00 0.0% nscd/1 8K sleep 59 0 0:00:00 0.0% rpcbind/1 280K sleep 59 0 0:00:00 0.0% sac/1 8K sleep 59 0 0:00:00 0.0% picld/5 8K sleep 59 0 0:00:00 0.0% picld/4 8K sleep 59 0 0:00:00 0.0% picld/3 8K sleep 59 0 0:00:00 0.0% picld/2 8K sleep 59 0 0:00:00 0.0% picld/1 8K sleep 59 0 0:00:00 0.0% kcfd/3 8K sleep 59 0 0:00:00 0.0% kcfd/2 8K sleep 59 0 0:00:00 0.0% kcfd/1 8K sleep 59 0 0:00:00 0.0% syseventd/14 8K sleep 59 0 0:00:00 0.0% syseventd/13 8K sleep 59 0 0:00:00 0.0% syseventd/12 8K sleep 59 0 0:00:00 0.0% syseventd/11 310 lwps, load averages: 0.50, 0.38, 0.40
244
245
246
247
Solaris Scheduling
Solaris implements a central dispatcher, with multiple scheduling classes
> Scheduling classes determine the priority range of the kernel
threads on the system-wide (global) scale, and the scheduling algorithms applied > Each scheduling class references a dispatch table
> Values used to determine time quantums and priorities > Admin interface to tune thread scheduling
248
Scheduling Classes
Traditional Timeshare (TS) class
> attempt to give every thread a fair shot at execution time
System (SYS)
> Only available to the kernel, for OS kernel threads
Realtime (RT)
> Highest priority scheduling class > Will preempt kernel (SYS) class threads > Intended for realtime applications
> Bounded, consistent scheduling latency
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
249
Same priority range as TS/IA class CPU resources are divided into shares Shares are allocated (projects/tasks) by administrator Scheduling decisions made based on shares allocated and used, not dynamic priority changes
250
Interrupts
global (system-wide) priority range
RT
+60
100 99 60 59
TS
FX
+60
IA
FSS
251
Scheduling Classes
Use dispadmin(1M) and priocntl(1)
# dispadmin -l CONFIGURED CLASSES ================== SYS (System Class) TS (Time Sharing) FX (Fixed Priority) IA (Interactive) FSS (Fair Share) RT (Real Time) # priocntl -l CONFIGURED CLASSES ================== SYS (System Class) TS (Time Sharing) Configured TS User Priority Range: -60 through 60 FX (Fixed priority) Configured FX User Priority Range: 0 through 60 IA (Interactive) Configured IA User Priority Range: -60 through 60 FSS (Fair Share) Configured FSS User Priority Range: -60 through 60 RT (Real Time) Maximum Configured RT Priority: 59 #
252
Scheduling Classes
The kernel maintains an array of sclass structures for each loaded scheduling class
> References the scheduling classes init routine, class functions
structure, etc
class-specific data structure > Different threads in the same process can be in different scheduling classes
Scheduling class operations vectors and CL_XXX macros allow a single, central dispatcher to invoke scheduling-class specific functions
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
253
254
> Ordered by thread priority > Queue occupation represented via a bitmap > For Realtime threads, a system-wide kernel preempt queue is
maintained
> Realtime threads are placed on this queue, not the per-CPU queues > If processor sets are configured, a kernel preempt queue exists for each processor set
Dispatch tables
> Per-scheduling class parameter tables > Time quantums and priorities > tuneable via dispadmin(1M)
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
255
dispq_t
dq_first dq_last dq_runcnt dq_first dq_last dq_runcnt
kthread_t
kernel thread kernel thread kernel thread
kernel thread
cpu structures
kernel thread
kernel thread
kernel thread
256
# ts_quantum 200 200 ......... 160 160 .......... 120 120 ......... 80 80 .......... 40 40 ........... 20
0 0 0 1 10 11 20 21 30 31 49
ts_tqexp ts_slpret 50 50 51 51 52 52 53 53 55 55 59
ts_maxwait 0 0 0 0 0 0 0 0 0 0 32000
ts_lwait 50 50 51 51 52 52 53 53 55 55 59 # # # # # # # # # # #
PRIORITY LEVEL 0 1 10 11 20 21 30 31 40 41 59
257
FX
> Time quantum only > For each possible priority
FSS
> Time quantum only > Just one, not defined for each priority level
SYS
> No dispatch table > Not needed, no rules apply
INT
> Not really a scheduling class
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
258
> Warm affinity > Depth and priority of existing runnable threads > Solaris 9 added Memory Placement Optimization (MPO) enabled will keep thread in defined locality group (lgroup)
if (thread is bound to CPU-n) && (pri < kpreemptpri) CPU-n dispatch queue if (thread is bound to CPU-n) && (pri >= kpreemptpri) CPU-n dispatch queue if (thread is not bound) && (pri < kpreemptpri) place thread on a CPU dispatch queue if (thread is not bound) && (pri >= kpreemptpri) place thread on cp_kp_queue
259
Thread Selection
The kernel dispatcher implements a select-and-ratify thread selection algorithm
> disp_getbest(). Go find the highest priority runnable thread, and
select it for execution > disp_ratify(). Commit to the selection. Clear the CPU preempt flags, and make sure another thread of higher priority did not become runnable > If one did, place selected thread back on a queue, and try again
> Try to get a warm cache > rechoose_interval kernel parameter > Default is 3 clock ticks
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
260
Thread Preemption
Two classes of preemption
> User preemption
> A higher priority thread became runnable, but it's not a realtime thread > Flagged via cpu_runrun in CPU structure > Next clock tick, you're outta here > Kernel preemption > A realtime thread became runnable. Even OS kernel threads will get preempted > Poke the CPU (cross-call) and preempt the running thread now > Note that threads that use-up their time quantum are evicted via the preempt mechanism > Monitor via icsw column in mpstat(1)
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
261
Thread Execution
Run until
> A preemption occurs
> Transition from S_ONPROC to S_RUN > placed back on a run queue > A blocking system call is issued > e.g. read(2) > Transition from S_ONPROC to S_SLEEP > Placed on a sleep queue > Done and exit > Clean up > Interrupt to the CPU you're running on > pinned for interrupt thread to run > unpinned to continue
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
262
Context Switching
CPU minf mjf xcal 0 74 2 998 1 125 3 797 4 209 2 253 5 503 7 2448 8 287 3 60 9 46 1 51 12 127 2 177 13 375 7 658 CPU minf mjf xcal 0 0 0 733 1 182 4 45 4 156 4 179 5 98 1 53 8 47 1 96 9 143 4 127 12 318 0 268 13 39 2 16 intr ithr csw icsw migr smtx 417 302 450 18 45 114 120 102 1107 16 58 494 114 100 489 12 45 90 122 100 913 21 53 225 120 100 771 20 35 122 115 99 671 16 20 787 117 101 674 14 27 481 1325 1302 671 23 49 289 intr ithr csw icsw migr smtx 399 297 548 10 8 653 117 100 412 16 34 49 108 102 1029 6 46 223 110 100 568 9 19 338 111 101 630 6 22 712 116 102 1144 11 42 439 111 100 734 9 30 96 938 929 374 8 9 103 srw syscl 0 1501 0 1631 0 1877 0 2626 0 1569 0 846 0 881 0 1869 srw syscl 0 518 0 904 0 1860 0 741 0 615 0 2443 0 1455 0 756 usr sys 56 7 41 16 56 11 32 21 50 12 81 16 74 12 48 16 usr sys 80 11 54 8 15 16 60 9 56 13 33 15 19 12 69 6 wt idl 0 37 0 44 0 33 0 48 0 38 0 3 0 14 0 37 wt idl 0 9 0 38 0 70 0 31 0 31 0 52 0 69 0 25
263
-Zqs all invountary context switches time quantum expiration count higher-priority preempt count total number context switches
*/ */ */ */
dtrace:::BEGIN { inv_cnt = 0; tqe_cnt = 0; hpp_cnt = 0; csw_cnt = 0; printf("%-16s %-16s %-16s %-16s\n","TOTAL CSW","ALL INV","TQE_INV","HPP_INV"); printf("==========================================================\n"); } sysinfo:unix:preempt:inv_swtch { inv_cnt += arg0; } sysinfo:unix::pswitch { csw_cnt += arg0; } fbt:TS:ts_preempt:entry / ((tsproc_t *)args[0]->t_cldata)->ts_timeleft <= 1 / { tqe_cnt++; } fbt:TS:ts_preempt:entry / ((tsproc_t *)args[0]->t_cldata)->ts_timeleft > 1 / { hpp_cnt++; } fbt:RT:rt_preempt:entry / ((rtproc_t *)args[0]->t_cldata)->rt_timeleft <= 1 / { tqe_cnt++; } fbt:RT:rt_preempt:entry / ((rtproc_t *)args[0]->t_cldata)->rt_timeleft > 1 / { hpp_cnt++; } tick-1sec { printf("%-16d %-16d %-16d %-16d\n",csw_cnt,inv_cnt,tqe_cnt,hpp_cnt); inv_cnt = 0; tqe_cnt = 0; hpp_cnt = 0; csw_cnt = 0; }
264
solaris10> ./csw.d TOTAL CSW ALL INV TQE_INV HPP_INV ========================================================== 1544 63 24 40 3667 49 35 14 4163 59 34 26 3760 55 29 26 3839 71 39 32 3931 48 33 15 ^C solaris10> ./threads & [2] 19913 solaris10> solaris10> ./csw.d TOTAL CSW ALL INV TQE_INV HPP_INV ========================================================== 3985 1271 125 1149 5681 1842 199 1648 5025 1227 151 1080 9170 520 108 412 4100 390 84 307 2487 174 74 99 1841 113 64 50 6239 170 74 96 ^C 1440 155 68 88
265
kernel cv_xxx() functions > The condition variable is set, and the thread is placed on a sleep queue > Wakeup may be directed to a specific thread, or all threads waiting on the same event or resource > One or more threads moved from sleep queue, to run queue
266
267
268
from running because a lower priority thread is holding a lock the higher priority thread needs > Blocking chains can form when mid priority threads get in the mix
Priority inheritance
> If a resource is held, ensure all the threads in the blocking
chain are at the requesting thread's priority, or better > All lower priority threads inherit the priority of the requestor
Usenix '06 Boston, Massachusetts
269
270
Processor Controls
Processor controls provide for segregation of workload(s) and resources Processor status, state, management and control
> Kernel linked list of CPU structs, one for each CPU > Bundled utilities
> psradm(1) > psrinfo(1) > Processors can be taken offline > Kernel will not schedule threads on an offline CPU > The kernel can be instructed not to bind device interrupts to processor(s) > Or move them if bindings exist
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
271
272
Processor Sets
Partition CPU resources for segregating workloads, applications and/or interrupt handling Dynamic
> Create, bind, add, remove, etc, without reboots
Once a set is created, the kernel will only schedule threads onto the set that have been explicitly bound to the set
> And those threads will only ever be scheduled on CPUs in the set
273
274
275
Timeshare No partitioning
CPU minf mjf xcal 0 18 0 777 1 30 0 13 4 22 0 4 5 26 0 7 8 24 0 6 9 22 0 5 12 20 0 4 13 20 0 13 CPU minf mjf xcal 0 26 0 761 1 18 0 5 4 24 0 7 5 14 0 22 8 28 0 7 9 24 0 5 12 34 0 8 13 20 0 8 intr ithr 412 303 124 101 131 112 116 100 121 100 116 100 119 101 115 100 intr ithr 407 301 116 101 124 110 115 101 113 100 116 101 119 101 122 100 csw icsw migr smtx 88 38 24 43 86 34 16 44 69 31 15 37 59 26 10 44 64 33 16 33 63 27 11 39 76 26 18 29 72 26 14 40 csw icsw migr smtx 45 28 14 43 86 27 23 35 64 29 12 30 82 30 23 45 61 24 11 42 75 25 22 41 71 28 18 29 74 33 17 33 srw syscl 0 173 0 181 0 84 0 76 0 105 0 73 0 70 0 80 srw syscl 0 80 1 73 0 60 0 97 0 69 0 83 0 63 0 71 usr sys 73 0 91 0 98 0 99 1 96 2 96 2 86 0 84 2 usr sys 87 0 89 0 99 1 71 2 94 4 78 5 90 8 76 5 wt idl 0 27 0 9 0 2 0 0 0 2 0 2 0 14 0 14 wt idl 0 13 0 11 0 0 0 27 0 2 0 17 0 2 0 19
276
intr ithr 401 301 101 100 109 107 103 102 124 100 121 100 117 100 124 100
wt 0 0 0 0 0 0 0 0
277
278
Files File System Virtual Disks Virtual Device Blocks Virtual Disks
279
FOP Layer
UFS
SPECFS
NFS
PROC
sd
Copyright 2006 Richard McDougall & James Mauro
ssd
Network
Kernel
280
segmap
File System
File Segment Driver (seg_map) VNODE Segment Driver (seg_vn)
text text
281
hit ratio can be Level 2 Page Cache measured with kstat -n segmap mmap()'d files bypass the segmap cache Disk Storage
282
getpage() / putpage()
_pagecreate() _getmap()
directio
bmap_write()
getpage() putpage()
_release()
bread() / bwrite()
Block I/O Subsystem
Noncached I/O
pvn_readdone() pvn_writedone()
sd
Copyright 2006 Richard McDougall & James Mauro
ssd
Usenix '06 Boston, Massachusetts
283
Filesystem performance
Attribution
> How much is my application being slowed by I/O? > i.e. How much faster would my app run if I optimized I/O?
Accountability
> What is causing I/O device utilization? > i.e. What user is causing this disk to be hot?
Tuning/Optimizing
> Tuning for sequential, random I/O and/or meta-data intensive
applications
284
285
percentage > i.e. My app spent 80% of its execution time waiting for I/O > Inverse is potential speed up e.g. 80% of time waiting equates to a potential 5x speedup Executing 20s Waiting 80s
286
Etruss
> Uses microstates to estimate I/O as wait time > http://www.solarisinternals.com
287
sol10$ ./iowait.d 639 ^C Time breakdown (milliseconds): <on cpu> <I/O wait> I/O wait breakdown (milliseconds): file1 file2 file4 file3 file5 file7 . . .
2478 6326
288
Solaris iostat
# iostat -xnz r/s 687.8 extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 38015.3 0.0 0.0 1.9 0.0 2.7 0 100 c0d0
Queue wait
Wait: number of threads queued for I/O Actv: number of threads performing I/O
wsvc_t: Average time spend waiting on queue asvc_t: Average time performing I/O %w: Only useful if one thread is running on the entire machine time spent waiting for I/O %b: Device utilization only useful if device can do just 1 I/O at a time (invalid for arrays etc...)
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
289
290
291
-C: report disk statistics by controller -l n: Limit the number of disks to n -m: Display mount points (most useful with -p) -r: Display data n comma separated format -s: Suppress state change messages -z: Suppress entries with all zero values -T d|u Display a timestamp in date (d) or unix time_t (u)
292
293
294
295
Practical Implications
> Virtual memory caches files, cache is dynamic > Minimum I/O size is the page size > Read/modify/write may occur on sub page-size writes
296
297
vmstat -p
swap free = free and unreserved swap in KBytes = free memory measured in pages re = kilobytes reclaimed from cache/free list mf = minor faults - the page was in memory but was not mapped fr = kilobytes that have been destroyed or freed de = kilobytes freed after writes sr = kilobytes scanned / second executable pages: kilobytes in - out - freed anonymous pages: kilobytes in - out freed file system pages: kilobytes in - out freed # vmstat -p 5 5 memory swap free ... 46715224 891296 46304792 897312 45886168 899808 46723376 899440
re 24 151 118 29
mf
page fr 0 25 1 0
de 0 0 0 0
sr 0 0 0 0
298
sol10# vmstat -p 3 memory page swap free re mf fr de 1057528 523080 22 105 0 0 776904 197472 0 12 0 0 776904 195752 0 0 0 0 776904 194100 0 0 0 0 sol10# ./pagingflow.d 0 => pread64 0 | pageio_setup:pgin 0 | pageio_setup:pgpgin 0 | pageio_setup:maj_fault 0 | pageio_setup:fspgin 0 | bdev_strategy:start 0 | biodone:done 0 <= pread64
299
sol10# ./fspaging.d Event Device get-page getpage-io cmdk0 get-page getpage-io cmdk0 get-page getpage-io cmdk0 get-page
300
sol10# ./fspaging.d Event Device put-page putpage-io cmdk0 other-io cmdk0 put-page putpage-io cmdk0 other-io cmdk0 put-page putpage-io cmdk0 other-io cmdk0 put-page putpage-io cmdk0 other-io cmdk0 put-page putpage-io cmdk0 other-io cmdk0 put-page
301
302
Buffer Cache
> Holds the meta-data of the file system: direct/indirect blocks,
inodes etc...
303
304
Random I/O
Attempt to cache as much as possible
> > > >
The best I/O is the one you don't have to do Eliminate physical I/O Add more RAM to expand caches Cache at the highest level > Cache in app if we can > In Oracle if possible
305
reclaim
freelist
306
Tuning segmap
By default, segmap is sized at 12% of physical memory
> Effectively sets the minimum amount of file system cache on
the system by caching in segmap over and above the dynamically-sized cachelist
On Solaris 8/9/10
> If the system memory is used primarily as a cache, cross calls
(mpstat xcall) can be reduced by increasing the size of segmap via the system parameter segmap_percent (12 by default) > segmap_percent = 100 is like Solaris 7 without priority paging, and will cause a paging storm > Must keep segmap_percent at a reasonable value to prevent paging pressure on applications e.g. 50%
307
Tuning segmap_percent
There are kstat statistics for segmap hit rates
> Estimate hit rate as (get_reclaim+get_use) / getmap
# kstat -n segmap module: unix name: segmap crtime fault faulta free free_dirty free_notfree get_nofree get_reclaim get_reuse get_unused get_use getmap pagecreate rel_abort rel_async rel_dontneed rel_free rel_write release snaptime instance: 0 class: vm 17.299814595 17361 0 0 0 0 0 67404 0 0 83 71177 757 0 3073 3072 616 2904 67658 583596.778903492
308
writes!
and databases
309
Asynchronous I/O
An API for single-threaded process to launch multiple outstanding I/Os
> Multi-threaded programs could just just multiple threads > Oracle databases use this extensively > See aio_read(), aio_write() etc...
standard pread/pwrite system calls > RAW disk: I/Os are passed into kernel via kaio(), and then managed via task queues in the kernel > Moderately faster than user-level LWP emulation
310
Database Cache 1k+ 512->1MB Log Writes DB Writes File System Solaris Cache
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
1k+ DB Reads
311
Direct I/O
Solaris 2.6+ Solaris 7+ Oracle 7.x, -> 8.1.5 - Yes 8.1.7, 9i - New Option Solaris 8, 2/01
Logging
Async I/O
313
315
Sequential I/O
Disk performance fundamentals
> Disk seek latency will dominate for random I/O
> ~5ms per seek > A typical disk will do ~200 I/Os per second random I/O > 200 x 8k = 1.6MB/s > Seekless transfers are typically capable of ~50MB/s > Requires I/O sizes of 64k+
316
317
File System
> Ensure file system groups I/Os and does read ahead > A well tuned fs will group small app I/Os into large Physical I/Os > e.g. UFS cluster size
IO Framework
> Ensure large I/O's can pass though > System param maxphys set largest I/O size
Volume Manager
> md_maxphys for SVM, or equiv for Veritas
318
Sequential on UFS
Sequential mode is detected by 2 adjacent operations > e.g read 8k, read8k UFS uses clusters to group reads/write
> UFS maxcontig param, units are 8k > Maxcontig becomes the I/O size for sequential > Cluster size defaults to 1MB on Sun FCAL > 56k on x86, 128k on SCSI > Auto-detected from SCSI driver's default > Set by default at newfs time (can be overridden) > e.g. Set cluster to 1MB for optimal sequential perf... > Check size with mkfs -m, set with tunefs -a
# mkfs -m /dev/dsk/c0d0s0 mkfs -F ufs -o nsect=63,ntrack=32,bsize=8192,fragsize=1024,cgsize=49,free=1,rps=60, nbpi=8143,opt=t,apc=0,gap=0,nrpos=8,maxcontig=7,mtb=n /dev/dsk/c0d0s0 14680512 # tunefs -a 128 /dev/rdsk/...
319
Average extent size: 22769 Blocks Note: The filestat command can be found on http://www.solarisinternals.com
320
Sequential on UFS
Cluster Read
> When sequential detected, read ahead entire cluster > Subsequent reads will hit in cache > Sequential blocks will not pollute cache by default
> i.e. Sequential reads will be freed sooner > Sequential reads go to head of cachelist by default > Set system param cache_read_ahead=1 if all reads should be cached
Cluster Write
> When sequential detected, writes are deferred until cluster is
full
321
asynchronously > Throttle blocks to prevent filling memory with async. Writes
Solaris 8 Defaults
> Block when 384k of unwritten cache
> Set ufs_HW=<bytes> > Resume when 256k of unwritten cache > Set ufs_LW=<bytes>
Solaris 9+ Defaults
> Block when >16MB of unwritten cache > Resume when <8MB of unwritten cache Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
322
Direct I/O
Introduced in Solaris 2.6 Bypasses page cache
> Zero copy: DMA from controller to user buffer
But
> No caching! Avoid unless application caches > No read ahead application must do it's own
323
Direct I/O
Enabling direct I/O
> Direct I/O is a global setting, per file or filesystem > Mount option
# mount -o forcedirectio /dev/dsk... /mnt
| DIRECTIO_OFF)
324
lreads = logical reads to the UFS via directio lwrites = logical writes to the UFS via directio preads = physical reads to media pwrites = physical writes to media Krd = kilobytes read Kwr = kilobytes written
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
325
affects for others using the same files > e.g. Broken backup utils doing small I/O's will hurt due to lack of prefetch
Usenix '06 Boston, Massachusetts
326
Practical Properties
> Creating files in tmpfs uses RAM just like a process > Uses swap just like a process's anonymous memory > Overcommit will cause anon paging
Best Practices
Copyright 2006 Richard McDougall & James Mauro
> Don't put large files in /tmp Usenix '06 Boston, Massachusetts
327
read()
write()
anon_alloc() anon_free()
tmpfs
sol8# mount -F tmpfs swap /mnt sol8# mkfile 100m /mnt/100m sol9# mdb -k > ::memstat Page Summary Pages --------------------------Kernel 31592 Anon 59318 Exec and libs 22786 Page cache 27626 Free (cachelist) 77749 Free (freelist) 38603 Total 257674
sol8# umount /mnt sol9# mdb -k > ::memstat Page Summary Pages MB %Tot --------------------------- ---------------- ---Kernel 31592 123 12% Anon 59311 231 23% Exec and libs 22759 88 9% Page cache 2029 7 1% Free (cachelist) 77780 303 30% 64203 Usenix '06 Boston, Massachusetts 25% 250 CopyrightFree Richard McDougall & James Mauro 2006 (freelist)
329
reclaim
freelist
330
Tuning segmap
By default, segmap is sized at 12% of physical memory
> Effectively sets the minimum amount of file system cache on
the system by caching in segmap over and above the dynamically-sized cachelist
On Solaris 8/9
> If the system memory is used primarily as a cache, cross calls
(mpstat xcall) can be reduced by increasing the size of segmap via the system parameter segmap_percent (12 by default) > segmap_percent = 100 is like Solaris 7 without priority paging, and will cause a paging storm > Must keep segmap_percent at a reasonable value to prevent paging pressure on applications e.g. 50%
331
Tuning segmap_percent
There are kstat statistics for segmap hit rates
> Estimate hit rate as (get_reclaim+get_use) / getmap
# kstat -n segmap module: unix name: segmap crtime fault faulta free free_dirty free_notfree get_nofree get_reclaim get_reuse get_unused get_use getmap pagecreate rel_abort rel_async rel_dontneed Copyright 2006 Richard rel_free McDougall & James Mauro instance: 0 class: vm 17.299814595 17361 0 0 0 0 0 67404 0 0 83 71177 757 0 3073 3072 Usenix '06 Boston, Massachusetts 616
332
writes!
and databases
333
Asynchronous I/O
An API for single-threaded process to launch multiple outstanding I/Os
> Multi-threaded programs could just multiple threads > Oracle databases use this extensively > See aio_read(), aio_write() etc...
standard pread/pwrite system calls > RAW disk: I/Os are passed into kernel via kaio(), and then managed via task queues in the kernel > Moderately faster than user-level LWP emulation
334
Database Cache 1k+ 512->1MB Log Writes DB Writes File System Solaris Cache
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
1k+ DB Reads
335
Direct I/O
Solaris 2.6+ Solaris 7+ Oracle 7.x, -> 8.1.5 - Yes 8.1.7, 9i - New Option Solaris 8, 2/01
Logging
Async I/O
337
asynchronously > Throttle blocks to prevent filling memory with async. Writes
Solaris 8 Defaults
> Block when 384k of unwritten cache
> Set ufs_HW=<bytes> > Resume when 256k of unwritten cache > Set ufs_LW=<bytes>
Solaris 9+ Defaults
> Block when >16MB of unwritten cache > Resume when <8MB of unwritten cache
Copyright 2006 Richard McDougall & James Mauro Usenix '06 Boston, Massachusetts
339
File system snapshots Enhanced logging w/ Direct I/O Concurrent Direct I/O 90% of RAW disk performance Enhanced directory lookup File create times in large directories significantly improved Creating file systems > Faster newfs(1M) (1TB was ~20 hous)
Solaris 9
> Scalable logging (for File Servers) 12/02
> Postmark whitepaper > > 1TB file Copyright 2006 Richard McDougall & James Maurosystems (16TB) 8/03 Usenix '06 Boston, Massachusetts
340
Integration with live upgrade 5/03 >1TB Volumes 5/03 >1TB Devices/EFI Support 11/03 Dynamic Reconfiguration Support 11/03
Future
> Cluster-ready Volume Manager > Disk Set Migration: Import/Export > Volume Creation Service
341
Solaris Yes
VxVM
VxFS
Yes Yes Sol 7 Yes Sol 8 Yes Sol 8 Yes Sol 8 2/02 QuickIO Sol 9 Sol 9 Yes/DMP Sol 9 Yes Yes Sol 9 Sol 9 5/03 Sol 9 5/03 3.5 Sol 9 8/03 3.5/VxVM Sol 9 8/03 Sol 9 8/03 Future VxCVM Future Yes
342
Summary
Solaris continues to evolve in both performance and resource management innovations Observability tools and utilities continue to get better Resource management facilities providing for improved overall system utilization and SLA management
343
Resources
http://www.solarisinternals.com http://www.sun.com/solaris http://www.sun.com/blueprints http://www.sun.com/bigadmin http://docs.sun.com
> "What's New in the Solaris 10 Operating Environment"
http://blogs.sun.com http://sun.com/solaris/fcc/lifecycle.html
344
Solaris 10
Jim Mauro
[email protected]