Optimizing Servers for High-Throughput and Low-Latency at Dropbox

Brought to you by
Optimizing Servers for High
Throughput and Low Latency
Alexey Ivanov
Software Engineer at Dapper Labs

Alexey Ivanov
Software Engineer, Dapper Labs
■ Previously: Traﬃc, Networking, and Databases @Dropbox
■ Performance: Hardware. OS. Application. RUM.

Optimizing (web-)Servers 5 Years Later…
This is an updated version of the nginx.conf’17 talk.
Changelog:
■ New hardware features are available. AMD EPYCs and ARM64 are a thing.
■ New linux kernel features. Especially around observability.
■ Replace nginx with a generic HTTP-server/-client focus.
● (Most of the clients and servers nowadays are HTTP- or HTTP/2-based)

The biggest performance gains are usually gained via high-level optimizations:
load-balancing, algorithms,data structures, and (especially) business logic.
A few examples from large scale production systems.
■ The lower the variance in backend load – the better.
● Applying “Two Random Choices” load-balancing greatly reduced latencies.
■ The fastest code is “no code”.
● E.g. at Dropbox we’ve pre-compressed static ﬁles for web so we spent 0% CPU on it while
maintaining the best possible compression ratio.
■ Algorithm improvements.
● Switching from zlib to brotli saved us both CPU and storage.
■ Data locality improvements.
● Switching from B+tree to LSM-based storage improved compression eﬃciency and reduced
database sizes by ~2.5x.
High-level vs Low-level Optimizations

CPU and Memory
Generally, picking the newest processor is the best choice since it will have the
most hardware offloads:
■ AVX2, BMI, ADX, AVX-512, AES-NI, SHA-NI (x86)
● (Symmetric/Asymmetric encryption, signatures, hashing, MACs)
■ PMUL, PMULL2, SHA256H, SHA3 (ARMv8.2+)
● (finite field arithmetic, hashing, MACs)
Many of the things that previously were prohibitively expensive now are almost
free due to hardware offloads: mTLS, crypto-hashing, storage encryption.

CPU and Memory (Cont’d)
What if budget is limited? Rules of thumb:
■ Low-latency: single NUMA-node, bigger caches, disabled SMT, more Ghz,
more memory channels.
■ High-throughput: more cores, enabled SMT, more memory.
Frequently, in production, high CPU usage does not mean a CPU bottleneck but a
“CPU pipeline stall” problem, i.e.: cache, TLB, or memory-bandwidth limitation.

github.com/andikleen/pmu-tools
# toplev.py -l1 --single-thread --force-events ./app
BE Backend_Bound: 60.34%
This category reflects slots where no uops are being
delivered due to a lack of required resources for
accepting more uops in the Backend of the pipeline...

github.com/andikleen/pmu-tools
# toplev.py -l3 --single-thread --force-events ./app
BE Backend_Bound: 60.42%
BE/Mem Backend_Bound.Memory_Bound: 32.23%
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 32.44%
This metric represents how often CPU was stalled without
missing the L1 data cache...
BE/Core Backend_Bound.Core_Bound: 45.93%
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 45.93%
This metric represents cycles fraction application was
stalled due to Core computation issues (non divider-
related)...

NICs
Relevant only for real hardware, not clouds.
■ 25Gbits or more, older NICs would likely have misc bottlenecks.
■ Open-source drivers, small ﬁrmwares, active community.
● In case if (but most likely, “when”) issues occur.

Pressure Stall Information (PSI)
“PSI provides for the ﬁrst time a canonical way to see resource pressure increases
as they develop, with new pressure metrics for three major resources—
memory, CPU, and IO.”
Source: https://facebookmicrosites.github.io/psi/docs/overview

PSI: global and Per-cgroup (v2)
$ cat /proc/pressure/io
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
$ cat /sys/fs/cgroup/cg1/io.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

Understanding
Software Dynamics
by Richard L. Sites

Kernel Optimizations
The best Linux optimization is the recent kernel version. New kernel versions bring
improvements to networking, memory management, io, and the rest of linux
subsystems.
But most importantly they bring improvements to observability tooling.

CPU and Memory
After you’ve picked the best CPU for your workload, you’ll need to utilize it to the
max:
■ For Intel/AMD you would want to use intel_pstate or amd-pstate driver.
● If you want to be more energy eﬃcient you may consider using schedutil governor. Use
performance otherwise.
■ Set NUMA aﬃnity for your application.
■ Use transparent huge pages.
● Careful here, this may lead to reduction in performance on some workloads.

Networking
The main goal of low-level tuning is to parallelize packet processing, add affinities,
increase buffer sizes, and enable hardware offloads.
■ ethtool is your friend here: # of queues, ring buffers, offloads, coalescing.
● -L, -G, -K, -C, etc.
● -S is your friend to keep track of drops/misses/errors/overruns/etc.
■ Mellanox and Intel cards come with set_irq_affinity/mlnx_affinity.
● Do not forget to turn off irqbalance.
■ After RSS is enabled it is generally a good idea to turn on XPS and xps_rxqs.
■ Avoid RPS. RFS can also have negative consequences.
■ For low latency: try to stay within the NUMA node PCIe NIC is attached to.

The main goal of high-level tuning is to remove transport-level bottlenecks.
■ Enabling BBR congestion control is generally a good idea.
■ Enabling FQ scheduler w/ pacing is always a good idea.
■ Your friends here are RUM metrics and
ss -n --extended --info or getsockopt(TCP_INFO/TCP_CC_INFO)
Networking (Cont’d)

iproute2
$ ss -tie
…
ts sack bbr rto:220 rtt:16.139/10.041 ato:40 mss:1448 pmtu:1500 rcvmss:1269
advmss:1428 cwnd:106 ssthresh:52 bytes_sent:9070462 bytes_retrans:3375
bytes_acked:9067087 bytes_received:5775 segs_out:6327 segs_in:551
data_segs_out:6315 data_segs_in:12
bbr:(bw:99.5Mbps,mrtt:1.912,pacing_gain:1,cwnd_gain:2) send 76.1Mbps
lastsnd:9896 lastrcv:10944 lastack:9864 pacing_rate 98.5Mbps delivery_rate
27.9Mbps delivered:6316 busy:3020ms rwnd_limited:2072ms(68.6%) retrans:0/5
dsack_dups:5 rcv_rtt:16.125 rcv_space:14400 rcv_ssthresh:65535 minrtt:1.907
…

It is impossible to talk about network tuning w/o mentioning sysctls. Here is a
couple of a relatively safe ones.
■ net.ipv4.tcp_slow_start_after_idle=0
● Should be safe if FQ w/ pacing is enabled.
■ net.ipv4.tcp_mtu_probing=1
● Must have on the edge (along with a slightly reduced advmss)
■ net.ipv4.tcp_rmem, net.ipv4.tcp_wmem
● Should be big enough for connections to not be rcv/snd window limited.
■ net.ipv4.tcp_notsent_lowat=262144
● Or even lower if HTTP/2 prioritization is used.
Sysctl Cargo Culting

Systems
Performance
BPF
Performance Tools
by Brendan Gregg

Compiler Flags, Toolchains, and Runtimes
Keeping you compiler/runtime up-to-date is generally a good idea.
■ Compiler upgrade, -O2, and -mtune can visibly affect performance.
● You can also try keeping -march/GOAMD64 in sync with your (cloud) hardware.
■ Link time optimization (LTO) can give a measurable perf boost.
■ Runtime upgrade can frequently give you single to double digit perf
improvements.
● For example, Go runtime upgrades frequently deliver memory/cpu usage improvements.
■ (Toolchain upgrades are also great from the security perspective)

Proﬁle-guided Optimization and Beyond
Most compilers are capable of PGO based on `perf record` proﬁles.
■ Clang has AutoFDO.
■ Golang would likely have Feedback-Guided Optimization in 1.20.
You can go beyond compile-time optimization and use post-link optimizer:
■ Facebook’s BOLT is now a part of LLVM:
https://github.com/llvm/llvm-project/tree/main/bolt

Any modern application consists of a myriad of libraries. Most servers nowadays
would have allocator, TLS, compression, and serialization libraries. These are the
main candidates for tuning. For example in case of C/C++ servers:
■ Keeping libraries up-to-date is important.
● It doesn’t matter whether CPU supports AVX2 if your library can’t use it.
■ Changing malloc implementation is an option.
● Both jemalloc and tcmalloc have excellent tuning guides.
■ BoringSSL can (mostly) be used as a drop–in replacement for OpenSSL.
● Often switching from RSA to ECDSA, or from AES to ChaCha (or back) can improve perf.
■ zlib has multiple performance-oriented forks.
● Intel, Cloudﬂare, zlib-ng.
● Sometimes more eﬃcient algorithms like brotli or zstd can be used instead.
Libraries

Designing
Data-Intensive
Applications
by Martin Kleppmann

Site
Reliability
Engineering
Chapter 19. Load Balancing at the Frontend
Chapter 20. Load Balancing in the Datacenter
Chapter 21. Handling Overload
Chapter 22. Addressing Cascading Failures

Brought to you by
Alexey Ivanov
rbtz@dapperlabs.com
@SaveTheRbtz

Optimizing Servers for High-Throughput and Low-Latency at Dropbox

Recommended

More Related Content

What's hot (20)

Similar to Optimizing Servers for High-Throughput and Low-Latency at Dropbox (20)

More from ScyllaDB (20)

Recently uploaded (20)

Optimizing Servers for High-Throughput and Low-Latency at Dropbox