Tuning RHEL For Databases
Tuning RHEL For Databases
Sanjay Rao Principal Software Engineer, Red Hat May 06, 2011
Bare metal
Tools
Reading Graphs
Proactive or Reactive Understand the Trade-offs No silver bullet You get what you pay for
What To Tune
I/O Memory CPU Network This session will cover I/O and Memory extensively
Multiple HBAs
How To
Deadline
Two queues per device, one for read and one for writes I/Os dispatched based on time spent in queue
CFQ
Per process queue Each process queue gets fixed time slice (based on process priority)
Noop
Boot-time
400M
350M
300M 1 250M 0 200M -1 150M -2 CFQ MP dev Deadline MP dev % Diff (CFQ vs Deadline)
100M
50M
-3
0M
-4
8K-SW
16K-SW
32K-SW
32K-RW
16K-RW
64K-SW
8K-RW
64K-RW
16K-RR
8K-SR
8K-RR
16K-SR
32K-SR
32K-RR
64K-SR
64K-RR
500M
200
400M
150
300M
100
200M
50
100M
0M
-50
8K-SW
8K-RW
16K-SW
32K-SW
32K-RW
64K-SW
64K-RW
16K-RW
16K-RR
32K-RR
64K-RR
8K-SR
8K-RR
16K-SR
32K-SR
64K-SR
165K
160K
164K
162K
140K
120K
100K
80K
60K
40K
20K
DEADLINE
CFQ
NOOP
100K
80K
Transactions / min
60K
40K
20K
K 10U 20U
Users
40U
60U
DSS Workload
Comparison CFQ vs Deadline
Oracle DSS Workload (with different thread count)
11:31 70
10:05
58.4 54.01
60
08:38
47.69
50
07:12
Time Elapsed
01:26
10
00:00
16
Parallel degree
32
Direct I/O
Avoid double caching Predictable performance Reduce CPU overhead Eliminate synchronous I/O stall Critical for I/O intensive applications Database (parameters to configure read ahead) Block devices (getra, setra)
Asynchronous I/O
70K
60K
50K
Trans/min
40K
30K
20K
10K
Separate files by I/O (data , logs, undo, temp) OLTP data files / logs DSS data files / temp files Use low latency / high bandwidth devices for hot spots
25
350K
20
300K
250K
Trans/Min
15
200K
10
150K
100K
5
50K
K
10U 40U 80U
1200K
1000K
800K
Trans/Min
600K
400K
200K
4G-FC
Fusion-io
01:40:48
01:26:24
01:12:00
Elapsed Time
00:57:36
00:43:12
00:28:48
00:14:24
00:00:00
4G-FC
Fusion-IO
80K
77K
60K
51K 44K 47K
2 FC Arrays Fusion-io
40K
20K
K
Different measurement metrics
What To Tune
Memory Tuning
NUMA required for scaling RHEL 5 / 6 completely NUMA aware Additional performance gains by enforcing NUMA placement
numactl CPU and memory pinning taskset CPU pinning cgroups (only in RHEL6) libvirt for KVM guests CPU pinning
2M pages vs 4K standard linux page Virtual to physical page map is 512 times smaller TLB can map more physical pages, resulting in fewer misses Traditional Huge Pages always pinned Transparent Huge Pages in RHEL6 Most databases support Huge pages How to configure Huge Pages (16G)
180K
175K
170K
165K
160K
150K
140K
130K
120K
110K
100K
default
hugepages
17.82
1200K
18
16
11.7
800K
Trans/Min
12
10
600K
8.23
8
400K
200K
2
0 Non NUMA NUMA non NUMA Huge Pages NUMA Huge Pages
1000K
14
Huge page allocation takes place uniformly across NUMA nodes Make sure that database shared segments are sized to fit Workaround Allocate Huge pages / Start DB / De-allocate Huge pages
Physical Memory 128G 4 NUMA nodes Huge Pages 80G 20G in each NUMA node 24G DB Shared Segment using Huge Pages 24G DB Shared Segment using NUMA and Huge Pages Huge Pages 100G 25G in each NUMA node
Drop unused Cache Frees unused memory File cache If the DB uses cache, may notice slowdown
CPU Tuning
CPU performance
How To
echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor best of both worlds cron jobs to configure the governor mode ktune (RHEL5) tuned-adm profile server-powersave (RHEL6)
08:38
07:12
05:46
04:19
02:53
01:26
00:00
performance
ondemand
powersave
vmstat output during test: 7 12 7 12 2 0 7 11 1 15 5884 122884416 485900 734376 5884 122885024 485900 734376 5884 122884928 485908 734376 5884 122885056 485912 734372 5884 122885176 485920 734376 0 0 0 0 0 0 184848 39721 9175 37669 0 217766 27468 9904 42807 0 168496 45375 6294 27759 0 178790 40969 9433 38140 0 248283 19807 7710 37788 4 4 4 4 5 1 89 2 87 1 90 1 90 2 86 6 6 5 5 7 0 0 0 0 0
Network Tuning
Network Performance
Separate network for different functions If on same network, use arp_filter to prevent ARP flux
echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter Supports RDMA w/ RHEL6 high performance networking package
10GigE
Database Performance
Application tuning
Cache = none
I/O from the guest is not cached on the host I/O from the guest is cached and written through on the host Works well on large systems (lots of memory and CPU) Potential scaling problems with this option with multiple guests (host CPU used to maintain cache) Can lead to swapping on the host
Cache = writethrough
How To Configure I/O - Cache per disk in qemu command line or libvirt
31.67
30
10
5.82
5
1Guest
4Guests
Configurable per device: Virt-Manager - drop-down option under Advanced Options Libvirt xml file - driver name='qemu' type='raw' cache='writethrough' io='native'
1000K
800K
600K
400K
200K
K
10U 20U
Configurable per device (only by xml configuration file): Libvirt xml file - driver name='qemu' type='raw' cache='writethrough' io='native'
250K
150K
100K
50K
1Guest
2 Guests
4 Guests
17:17
14:24
11:31
08:38
05:46
02:53
00:00
1Guest
2 Guests
4 Guests
350K
28.6
30
300K
25
250K
20
200K
15
150K
10
100K
5
50K
0.0
4Guest-24vcpu-56G 4Guest-24vcpu-56G-NUMA
VirtIO
VirtIO drivers for network Bypass the qemu layer Bypass the host and pass the PCI device to the guest Can be passed only to one guest
Pass through to the guest Can be shared among multiple guests Limited hardware support
Latency (usecs)
Monitoring tools
top, vmstat, ps, iostat, netstat, sar, perf /proc, sysctl, AltSysRq ethtool, ifconfig oprofile, strace, ltrace, systemtap, perf
Kernel tools
Networking
Profiling
I/O
Choose the right elevator Eliminated hot spots Direct I/O or Asynchronous I/O Virtualization Caching NUMA Huge Pages Swapping Managing Caches
Memory
CPU
Network
Wrap Up Virtualization
VirtIO drivers aio (native) NUMA Cache options (none, writethrough) Network (vhost-net)