Linux kernel tracing superpowers in the cloud

Linux kernel tracing superpowers
in the cloud
Andrea Righi
andrea@betterservers.com
@arighi

Who am I?
●
Andrea Righi
●
Performance engineer @
BetterServers.com
●
My main activities
●
Linux kernel stuff
●
Virtualization
●
Storage
●
Cloud computing

Agenda
●
Overview
●
Profiling Technologies
●
Examples
●
Q/A

https://imgs.xkcd.com/comics/optimization.png
Premature optimization
anti-methodology

Drunk-man anti-methodology
●
Tune random things until the problem goes
away

Blame someone else
anti-methodology
●
Find a component X that you are not
responsible for and redirect problems to
component X

Problem-solving methodology
●
Observe
●
Measure
●
Optimize
●
Rinse and repeat...

CPU sampling vs tracing
●
Sampling
●
Create a periodic timed interrupt that collects the
current program counter, function address and the
entire stack back trace
●
Tracing
●
Record times and invocations of specific events

Generic performance analysis tools
●
uptime → system lifetime and load average
●
top → generic overall system stat
●
vmstat 1 → system/memory stat by time
●
mpstat -P ALL 1 → CPU load balancing
●
pidstat 1 → process usage
●
iostat -kxd 1 → disk I/O
●
free -m → memory usage
●
sar -n DEV 1 → network I/O
●
dmesg | tail → last kernel error messages

perf
●
perf is a powerful multi-tool and profiler
●
Interval sampling
●
CPU performance counter events
●
user + kernel sampling and tracing
●
event filtering
●
perf top → best tool to get an idea of what’s
going on in the system

Visualizing traces: flame graphs
●
CPU flame graphs
●
x-axis
sample population
●
y-axis
●
stack depth
●
Wider boxes =
More samples =
More time spent on CPU

strace
●
strace(1): system call tracer in Linux
●
It uses the ptrace() system call that pauses the
target process for each syscall so that the
debugger can read the state
●
And it’s doing this twice: when the syscall begins
and when it ends!

strace overhead
### Regular execution ###
righiandr@Dell:~$ dd if=/dev/zero of=/dev/null bs=1 count=500k
512000+0 records in
512000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 0,201641 s, 2,5 MB/s
### Strace execution (tracing a syscall that is never called) ###
righiandr@Dell:~$ strace -eaccept dd if=/dev/zero of=/dev/null bs=1 count=500k
512000+0 records in
512000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 11,7989 s, 43,4 kB/s
+++ exited with 0 +++

Tracepoint
●
A tracepoint is special code statically placed in
your program (programmer defines where to put
the tracepoint)
●
If someone wants to see when the tracepoint is
hit and extract data they can “enable” or “activate”
the tracepoint using a specific interface
●
Two elements are required:
●
Tracepoint definition (placed in a header file)
●
Tracepoint statement (in C code)

Tracepoint example
TRACE_EVENT(ext4_free_inode,
TP_PROTO(struct inode *inode),
TP_ARGS(inode),
TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
__field( uid_t, uid )
__field( gid_t, gid )
__field( __u64, blocks )
__field( __u16, mode )
),
TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
__entry->uid = i_uid_read(inode);
__entry->gid = i_gid_read(inode);
__entry->blocks = inode->i_blocks;
__entry->mode = inode->i_mode;
),
TP_printk("dev %d,%d ino %lu mode 0%o uid %u gid %u blocks %llu",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino, __entry->mode,
__entry->uid, __entry->gid, __entry->blocks)
);

Kprobes (Kernel probes)
●
Trap almost every kernel code address, specifying a handler routine to be
invoked when the breakpoint is hit
●
How does it work?
●
Make a copy of the probed instruction and replace the original instruction with a
breakpoint instruction (int3 on x86)
●
When the breakpoint is hit, a trap occurs, CPU's registers are saved and the control
passes to the Kprobes pre-handler
●
The saved instruction is executed in single-step mode
●
The Kprobes post-handler is executed
●
The rest of the original function is executed
●
Same mechanism can be applied to user-space
●
uprobes

Kprobe example: stack trace
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>
static const char function_name[] = "schedule_timeout";
static int my_handler(struct kprobe *p, struct pt_regs *regs)
{
dump_stack();
printk(KERN_INFO "%s called %s(%d)n",
current->comm, function_name, (int)regs->di);
}
static struct kprobe my_kp = {
.pre_handler = my_wrapper,
.symbol_name = function_name,
};
static int __init my_kprobe_init(void)
{
return register_kprobe(&my_kp);
}
static void __exit my_kprobe_exit(void)
{
unregister_kprobe(&my_kp);
}

Example: kprobe / uprobe
●
Example (kprobe)
$ sudo ./bin/kprobe 'p:do_sys_open filename=+0(%si):string'
$ sudo ./bin/kprobe 'p:SyS_execve filename=+0(%di):string'
●
Example (uprobe)
$ sudo ./bin/uprobe 'r:bash:readline +0($retval):string'
$ sudo ./bin/uprobe 'p:/lib/x86_64-linux-gnu/libc-2.23.so:system +0(%di):string'
$ sudo ./bin/uprobe 'p:/lib/x86_64-linux-gnu/libc-2.23.so:malloc size=%di'
●
Tracing format
$ sudo cat /sys/kernel/debug/tracing/trace
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION

More advanced examples
●
Access complex data struct via kprobe and
perf probe:
$ sudo -i perf probe --vmlinux=/home/righiandr/linux/vmlinux
-nv 'netif_receive_skb skb->dev->name'
...
Writing event: p:probe/netif_receive_skb _text+7991520 name=+0(+16(%di))
...
$ sudo ./bin/kprobe 'p:netif_receive_skb name=+0(+16(%di)):string'

Tracing overhead
●
strace: high overhead
●
tracepoints: low overhead
●
kprobes/uprobes: very low overhead

eBPF: definition
●
eBPF: a highly efficient virtual machine that
lives in the kernel
●
Ingo Molnar described eBPF as
● “One of the more interesting features in this cycle is the
ability to attach eBPF programs (user-defined, sandboxed
bytecode executed by the kernel) to kprobes. This allows
user-defined instrumentation on a live kernel image that can
never crash, hang or interfere with the kernel negatively”

eBPF history
●
Initially it was BPF: Berkeley Packet Filter
●
It has its roots in BSD in the very early 1990’s
●
Originally designed as a mechanism for fast filtering network packets
●
Initially used in Linux by tcpdump to implement the filtering “engine”
behind its complex command-line syntax
●
Linux introduced eBPF: extended Berkeley Packet Filter (3.18 –
December 2014)
●
More efficient / more generic than the original BPF
●
Kernel 4.9: eBPF programs can be attached to perf_events
●
Timed samples can now run BPF programs!

eBPF as a VM
●
Example assembly of a simple eBPF filter
●
Load 16-bit quantity from offset 12 in the packet to the
accumulator (ethernet type)
●
Compare the value to see if the packet is an IP packet
●
If the packet is IP, return TRUE (packet is accepted)
●
otherwise return 0 (packet is rejected)
●
Only 4 VM instructions to filter IP packets!
ldh [12]
jeq #ETHERTYPE_IP, l1, l2
l1: ret #TRUE
l2: ret #0

eBPF context
●
eBPF is not specific to any particular context
●
packet filtering: context is a packet
●
tracing: context is a snapshot of processor registers when the tracepoint is hit
●
JIT:
●
every BPF instruction is mapped to a x86 instruction sequence
●
accumulator and index registers stored directly into processor’s registers
●
program is placed in a vmalloc() space and executed directly when a context
is processed

How to write a eBPF filter
●
A filter can be written in C
●
GCC backend as well as LLVM
backend
●
Compiler generates eBPF byte
code which resides in an ELF file
●
Load the program into the kernel
by using the bpf() syscall
/*
* tracing filter example to print events
* for loobpack device only if attached to
* netif_receive_skb()
*/
#include <linux/skbuff.h>
#include <linux/netdevice.h>
#include <linux/bpf.h>
#include <trace/bpf_trace.h>
void filter(struct bpf_context *ctx)
{
char devname[4] = "lo";
struct net_device *dev;
struct sk_buff *skb = 0;
skb = (struct sk_buff *)ctx->regs.si;
dev = bpf_load_pointer(&skb->dev);
if (bpf_memcmp(dev->name, devname, 2) == 0) {
char fmt[] = "skb %p dev %p n";
bpf_trace_printk(fmt, sizeof(fmt),
(long)skb, (long)dev, 0);
}
}

Source: https://www.goodreads.com/author_blog_posts/14131100-linux-4-9-s-efficient-bpf-profiler

Parasite thread injection
●
Concept of parasite thread injection introduced in Linux 3.4
(via PTRACE_SEIZE)
●
Attach to the target pid without stopping it and becoming a
“parasite” thread of pid
●
Original goal: freeze and restore TCP connections during
checkpoint/restart
●
Example
●
python-pyrasite: injecting code into running Python programs

References
●
Brendan Gregg blog
●
http://brendangregg.com/blog/
●
BCC tools
●
https://github.com/iovisor/bcc
●
Perf-tools
●
https://github.com/brendangregg/perf-tools
●
Perf-labs
●
https://github.com/brendangregg/perf-labs
●
Linux documentation
●
http://lxr.linux.no/linux/Documentation/trace
●
http://lxr.linux.no/linux/Documentation/kprobes.txt
●
The BSD Packet Filter: A New Architecture for User-level Packet Capture -
S. McCanne and V. Jacobson
●
http://www.tcpdump.org/papers/bpf-usenix93.pdf
●
Linux weekly news
●
http://lwn.net

Thanks
●
@arighi
●
andrea@betterservers.com

Linux kernel tracing superpowers in the cloud

More Related Content

What's hot (20)

Similar to Linux kernel tracing superpowers in the cloud (20)

Recently uploaded (20)

Linux kernel tracing superpowers in the cloud