OpenZFS data-driven performance

Data-Driven Development in
OpenZFS
Adam Leventhal, CTO Delphix
@ahl

ZFS Was Slow, Is Faster

Adam Leventhal, CTO Delphix
@ahl

My Version of ZFS History
• 2001-2005 The 1st age of ZFS: building the behemoth
– Stability, reliability, features

• 2006-2008 The 2nd age of ZFS: appliance model and open source
– Completing the picture; making it work as advertised; still more features

• 2008-2010 The 3rd age of ZFS: trial by fire
– Stability in the face of real workloads
– Performance in the face of real workloads

The 1st Age of OpenZFS
• All the stuff Matt talked about, yes:
– Many platforms
– Many companies
– Many contributors

• Performance analysis on real and varied customer workloads

A note about the data
•
•
•
•
•

The data you are about to see is real
The names have been changed to protect the innocent (and guilty)
It was mostly collected with DTrace
We used some other tools as well: lockstat, mpstat
You might wish I had more / different data – I do too

NFS Sync Writes
sync write
microseconds
value ------------- Distribution ------------- count
8|
0
16 |
149
32 |@@@@@@@@@@@@@@@@@@@@@
64 |@@@@@
2226
128 |@@@@
1743
256 |@@
658
512 |
95
1024 |
20
2048 |
19
4096 |
122
8192 |@@
744
16384 |@@
865
32768 |@@
625
65536 |@
316
131072 |
113
262144 |
22
524288 |
70
1048576 |
94
2097152 |
16
4194304 |
0

8682

IO Writes
write

microseconds
16 |
0
32 |
338
64 |
490
128 |
720
256 |@@@@
15079
512 |@@@@@
20342
1024 |@@@@@@@
27807
2048 |@@@@@@@@
28897
4096 |@@@@@@@@
29910
8192 |@@@@@
20605
16384 |@
5081
32768 |
1079
65536 |
69
131072 |
5
262144 |
1
524288 |
0

NFS Sync Writes: Even Worse
sync write
microseconds
8|
0
16 |@
9
32 |@@@@@@@@@@
84
64 |@@@@@@@@@@
85
128 |@@@@
34
256 |@
9
512 |
0
1024 |
1
2048 |
2
4096 |@
7
8192 |@@
19
16384 |@
7
32768 |
2
65536 |
2
131072 |
0
262144 |
0
524288 |
0
1048576 |@@
14
2097152 |@@@@@@
51
4194304 |@
7
8388608 |
0

First Problem: The Write Throttle

How long is spa_sync() taking?
#!/usr/sbin/dtrace -s
fbt::spa_sync:entry
/stringof(args[0]->spa_name) == "domain0"/
{
self->ts = timestamp;
loads = 0;
}
fbt::space_map_load:entry
/stringof(args[4]->os_spa->spa_name) == "domain0"/
{
loads++;
}
fbt::spa_sync:return
{
@["microseconds", loads] = quantize((timestamp - self->ts) / 1000);
self->ts = 0;
}

How long is spa_sync() taking?
# ./sync.d -c 'sleep 60'
dtrace: script './sync.d' matched 3 probes
dtrace: pid 20420 has exited

microseconds
15
524288 |
0
1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
2097152 |
0
microseconds
16
524288 |
0
1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
2097152 |@@@@@@@@@@
7
4194304 |
0

20

Where is spa_sync() giving up the CPU?
#!/usr/sbin/dtrace -s
fbt::spa_sync:entry{ self->ts = timestamp; }

sched:::off-cpu/self->ts/{ self->off = timestamp; }
sched:::on-cpu
/self->off/
{
@s[stack()] = quantize((timestamp - self->off) / 1000);
self->off = 0;
}
fbt::spa_sync:return
/self->ts/
{
@t["microseconds", probefunc] = quantize((timestamp - self->ts) / 1000);
self->ts = 0;
self->sync = 0;
}

Where is spa_sync() giving up the CPU?
…
genunix`cv_wait+0x61
zfs`zio_wait+0x5d
zfs`dsl_pool_sync+0xe1
zfs`spa_sync+0x38d
zfs`txg_sync_thread+0x247
unix`thread_start+0x8
256 |
0
512 |@@@@@@
4
1024 |@@@@@@@@@@@@
2048 |
0
4096 |
0
8192 |
0
16384 |
0
32768 |
0
65536 |
0
131072 |
0
262144 |
0
524288 |@@@@
3
1048576 |@@@
2
2097152 |@@@@@@@@@@@@@
4194304 |@
1
8388608 |
0

8

9

ZFS Write Throttle
•
•
•
•
•

Keep transactions to a reasonable size – limit outstanding data
Target a fixed time (1-5 seconds on most systems)
Figure out how much we can write in that time
Don’t accept more than that amount of data in a txg
When we get to 7/8ths of the limit, insert a 10ms delay

ZFS Write Throttle
•
•
•
•
•

Keep transactions to a reasonable size – limit outstanding data
Target a fixed time (1-5 seconds on most systems)
Figure out how much we can write in that time
Don’t accept more than that amount of data in a txg
When we get to 7/8ths of the limit, insert a 10ms delay
WTF!?

7/8ths full delaying for 10ms
async write
microseconds
16 |
0
32 |@@@@@@@@@@@@@
1549
64 |@@@@@@@@@@@
1306
128 |@@@@@@@@@
1049
256 |@@
192
512 |
34
1024 |
23
2048 |
47
4096 |@
63
8192 |@
153
16384 |@
83
32768 |
11
65536 |
5
131072 |
4
262144 |
3
524288 |@
102
1048576 |@
106
2097152 |@
69
4194304 |
0

Observing the write throttle limit (second-bysecond)
# dtrace -n 'BEGIN{ start = timestamp; } fbt::dsl_pool_sync:entry/stringof(args[0]->dp_spa->spa_name) == "domain0"/{
@[(timestamp - start) / 1000000000] = min(args[0]->dp_write_limit / 1000000); }' -xaggsortkey -c 'sleep 600'
dtrace: description 'BEGIN' matched 2 probes
…
9
470
10
470
11
487
14
487
15
515
16
515
17
557
18
581
19
581
20
617
21
617
22
635
23
663
24
663
25
673

Saw anywhere from 100 – 800 MB!

Check out IO queue times
microseconds
write sync
0|
0
1|
2
2 |@@@@@@@
51
4 |@@@@@@
43
8 |@
5
16 |
3
32 |@
6
64 |@
10
128 |@@
13
256 |@@
18
512 |@@@@@
38
1024 |@@@@@@
44
2048 |@@@@@
37
4096 |@@@
24
8192 |@
9
16384 |
0

IO times with queue depth 10 (default)
write

microseconds
16 |
0
32 |
70
64 |
170
128 |
130
256 |@@
1143
512 |@@@
1762
1024 |@@@@
2417
2048 |@@@@@@@
4135
4096 |@@@@@@@@
4816
8192 |@@@@@@@
4132
16384 |@@@@
2370
32768 |@@@
1456
65536 |
148
131072 |
8
262144 |
0

IO times with queue depth 20
write

microseconds
16 |
0
32 |
43
64 |
137
128 |@
243
256 |@@@@@
2233
512 |@@@@@
2238
1024 |@@@@
1968
2048 |@@@@@
2395
4096 |@@@@@@
2660
8192 |@@@@@@
2829
16384 |@@@@@
2499
32768 |@@@
1466
65536 |@
296
131072 |
0

write

microseconds
16 |
0
32 |
82
64 |
137
128 |
230
256 |@@@@
2195
512 |@@@@
2589
1024 |@@@@
2416
2048 |@@@@@
2844
4096 |@@@@@@
3330
8192 |@@@@@@
3794
16384 |@@@@@@
3306
32768 |@@@
2008
65536 |@
443
131072 |
1
262144 |
0

microseconds
write
16 |
0
32 |
345
64 |@
697
128 |
169
256 |
60
512 |
380
1024 |@
1084
2048 |@
1562
4096 |@
1819
8192 |@@@@
4974
16384 |@@@@@@@@@
32768 |@@@@@@@@@@@@@
65536 |@@@@@@@@@
131072 |@
1050
262144 |
0

write

avg latency
44557us

10683
15637
10608

iops throughput
817/s 30300k/s

microseconds
write
16 |
0
32 |
330
64 |@
665
128 |
228
256 |
203
512 |@
552
1024 |@
1135
2048 |@
1458
4096 |@
1434
8192 |@@
2049
16384 |@@@@
4070
32768 |@@@@@@@
7936
65536 |@@@@@@@@@@@
11269
131072 |@@@@@@@@@
9737
262144 |@
1282
524288 |
0

write

avg latency
88774us

iops throughput
705/s 38303k/s

IO Problems
• The choice of IO queue depth was crucial
– Where did the default of 10 come from?!
– Balance between latency and throughput

• Shared IO queue for reads and writes
– Maybe this makes sense for disks… maybe…

• The wrong queue depth caused massive queuing within ZFS
– “What do you mean my SAN is slow? It looks great to me!”

New IO Scheduler
•
•
•
•

Choose a limit on the “dirty” (modified) data on the system
As more accumulates, schedule more concurrent IOs
Limits per IO type
If we still can’t keep up, start to limit the rate of incoming data

• Chose defaults as close to the old behavior as possible
• Much more straightforward to measure and tune

Third Problem: Lock Contention

Looking at lockstat(1M) (1/3)
Count indv cuml rcnt nsec Lock
Caller
167980 9% 9% 0.00 61747 0xffffff0d4aaa4818 taskq_thread+0x2a8
nsec ------ Time Distribution ------ count Stack
512 |
3233
thread_start+0x8
1024 |@
10651
2048 |@@@@
26537
4096 |@@@@@@@@@@
56854
8192 |@@@@@
29262
16384 |@
10577
32768 |@
5703
65536 |
5053
131072 |
3555
262144 |
5272
524288 |
5400
1048576 |
4186
2097152 |
1487
4194304 |
163
8388608 |
17
16777216 |
21
33554432 |
7
67108864 |
2

Caller
166416 8% 17% 0.00 88424 0xffffff0d4aaa4818

cv_wait+0x69

512 |@
7775
taskq_thread_wait+0x84
1024 |@@
14577 taskq_thread+0x308
2048 |@@@@@
31499 thread_start+0x8
4096 |@@@@@@
36522
8192 |@@@
19818
16384 |@
11065
32768 |@
7302
65536 |@
7932
131072 |
5537
262144 |@
7992
524288 |@
8003
1048576 |@
6017
2097152 |
2086
4194304 |
198
8388608 |
48
16777216 |
37
33554432 |
7
67108864 |
1

Caller
136877 7% 24% 0.00 19897 0xffffff0d4aaa4818

taskq_dispatch_ent+0x4a

512 |
1798
zio_taskq_dispatch+0xb5
1024 |
1575
zio_issue_async+0x19
2048 |@
5593
zio_execute+0x8d
4096 |@@@@@@@@@@@@@
61337
8192 |@@@@
19408
16384 |@@@
15724
32768 |@@@
13923
65536 |@@
9733
131072 |
3564
262144 |
3171
524288 |
947
1048576 |
84
2097152 |
1
4194304 |
0
8388608 |
15
16777216 |
1
33554432 |
2
67108864 |
1

Name that lock!
> 0xffffff0d4aaa4818::whatis
ffffff0d4aaa4818 is ffffff0d4aaa47fc+20, allocated from taskq_cache
> 0xffffff0d4aaa4818-20::taskq
ADDR
NAME
ACT/THDS Q'ED MAXQ INST
ffffff0d4aaa47fc zio_write_issue
0/ 24 0 26977 -

Lock Breakup
•
•
•
•

Broke up the taskq lock for write_issue
Added multiple taskqs, randomly assigned
Recently hit a similar problem for read_interrupt
Same solution

• Worth investigating taskq stats
• A dynamic taskq might be an interesting experiment

• Other lock contention issues resolved
• Still more need additional attention

Last Problem: Spacemap Shenanigans

Where does spa_sync() spend its time?
…
dsl_pool_sync_done
16us ( 0%)
spa_config_exit
19us ( 0%)
zio_root
20us ( 0%)
spa_config_enter
23us ( 0%)
spa_errlog_sync
45us ( 0%)
spa_update_dspace
49us ( 0%)
zio_wait
53us ( 0%)
dmu_objset_is_dirty
66us ( 0%)
spa_sync_config_object
75us ( 0%)
spa_sync_aux_dev
79us ( 0%)
list_is_empty
86us ( 0%)
dsl_scan_sync
124us ( 0%)
ddt_sync
201us ( 0%)
txg_list_remove
519us ( 0%)
vdev_config_sync
1830us ( 0%)
bpobj_iterate
9939us ( 0%)
vdev_sync
27907us ( 1%)
bplist_iterate
35301us ( 1%)
vdev_sync_done
346336us (16%)
dsl_pool_sync
1652050us (79%)
spa_sync
2077646us (100%)

…
dsl_pool_sync_done
16us ( 0%)
spa_config_exit
19us ( 0%)
zio_root
20us ( 0%)
spa_config_enter
23us ( 0%)
spa_errlog_sync
45us ( 0%)
spa_update_dspace
49us ( 0%)
zio_wait
53us ( 0%)
dmu_objset_is_dirty
66us ( 0%)
75us ( 0%)
spa_sync_aux_dev
79us ( 0%)
list_is_empty
86us ( 0%)
dsl_scan_sync
124us ( 0%)
ddt_sync
201us ( 0%)
txg_list_remove
519us ( 0%)
vdev_config_sync
1830us ( 0%)
bpobj_iterate
9939us ( 0%)
vdev_sync
27907us ( 1%)
bplist_iterate
35301us ( 1%)
vdev_sync_done
346336us (16%)
dsl_pool_sync
1652050us (79%)
spa_sync
2077646us (100%)

This is expected; it means
we’re writing

…
dsl_pool_sync_done
16us ( 0%)
spa_config_exit
19us ( 0%)
zio_root
20us ( 0%)
spa_config_enter
23us ( 0%)
spa_errlog_sync
45us ( 0%)
spa_update_dspace
49us ( 0%)
zio_wait
53us ( 0%)
dmu_objset_is_dirty
66us ( 0%)
75us ( 0%)
spa_sync_aux_dev
79us ( 0%)
list_is_empty
86us ( 0%)
dsl_scan_sync
124us ( 0%)
ddt_sync
201us ( 0%)
txg_list_remove
519us ( 0%)
vdev_config_sync
1830us ( 0%)
bpobj_iterate
9939us ( 0%)
vdev_sync
27907us ( 1%)
bplist_iterate
35301us ( 1%)
vdev_sync_done
346336us (16%)
dsl_pool_sync
1652050us (79%)
spa_sync
2077646us (100%)

What’s this?

What’s vdev_sync_done() doing?
txg_list_empty
txg_list_remove
metaslab_sync_done
vdev_sync_done

0us ( 0%)
15us ( 0%)
8681us (90%)
9563us (100%)

How about metaslab_sync_done()?
vdev_dirty
vdev_space_update
space_map_load_wait
space_map_vacate
metaslab_weight
metaslab_group_sort
space_map_unload
metaslab_sync_done

3266us
5333us
5758us
30455us
54507us
68445us
1519906us
1630626us

What about all space_map_*() functions?
space_map_truncate
33 times
6ms ( 0%)
space_map_load_wait
1721 times
7ms ( 0%)
space_map_sync
3766 times
210ms ( 0%)
space_map_unload
135 times
1268ms ( 0%)
space_map_free
21694 times
4280ms ( 1%)
space_map_vacate
3643 times
45891ms (12%)
space_map_seg_compare
13124822 times
55423ms (14%)
space_map_add
580809 times
79868ms (21%)
space_map_remove
514181 times
81682ms (21%)
space_map_walk
2081 times
120962ms (32%)
spa_sync
1 times
374818ms (100%)

How about the CPU performance counters?
# dtrace -n 'cpc:::PAPI_tlb_dm-all-10000{ @[stack()] = count(); }' -n END'{ trunc(@, 20); printa(@); }' -c 'sleep 100’
…
zfs`metaslab_segsize_compare+0x1f
genunix`avl_find+0x52
genunix`avl_add+0x2d
zfs`space_map_remove+0x170
zfs`space_map_alloc+0x47
zfs`metaslab_group_alloc+0x310
zfs`metaslab_alloc_dva+0x2c1
zfs`metaslab_alloc+0x9c
zfs`zio_dva_allocate+0x8a
zfs`zio_execute+0x8d
genunix`taskq_thread+0x285
unix`thread_start+0x8
1550
zfs`lzjb_decompress+0x89
zfs`zio_decompress_data+0x53
zfs`zio_decompress+0x56
zfs`zio_pop_transforms+0x3d
zfs`zio_done+0x26b
zfs`zio_notify_parent+0xa6
zfs`zio_done+0x4ea
zfs`zio_notify_parent+0xa6

Spacemaps and Metaslabs
• Two things going on here:
– 30,000+ segments per spacemap
– Building the perfect spacemap – close enough would work
– Doing a bunch of work that we can clever our way out of

• Still much to be done:
– Why 200 metaslabs per LUN?
– Allocations can still be very painful

The Next Age of OpenZFS
• General purpose and purpose-built OpenZFS products
• Used for varied and demanding uses
• Data-driven discoveries
–
–
–
–

Write throttle needed rethinking
Metaslabs / spacemaps / allocation is fertile ground
Performance nose-dives around 85% of pool capacity
Lock contention impacts high-performance workloads

• What’s next?
–
–
–
–

More workloads; more data!
Feedback on recent enhancements
Connect allocation / scrub to the new IO scheduler
Consider data-driven, adaptive algorithms within OpenZFS

OpenZFS data-driven performance

Recommended

More Related Content

What's hot (20)

Similar to OpenZFS data-driven performance (20)

Recently uploaded (20)

OpenZFS data-driven performance