SlideShare a Scribd company logo
Linux kernel tracing superpowers
in the cloud
Andrea Righi
andrea@betterservers.com
@arighi
Who am I?
●
Andrea Righi
●
Performance engineer @
BetterServers.com
●
My main activities
●
Linux kernel stuff
●
Virtualization
●
Storage
●
Cloud computing
Agenda
●
Overview
●
Profiling Technologies
●
Examples
●
Q/A
Overview
Linux kernel tracing superpowers in the cloud
https://imgs.xkcd.com/comics/optimization.png
Premature optimization
anti-methodology
Drunk-man anti-methodology
●
Tune random things until the problem goes
away
Blame someone else
anti-methodology
●
Find a component X that you are not
responsible for and redirect problems to
component X
Problem-solving methodology
●
Observe
●
Measure
●
Optimize
●
Rinse and repeat...
CPU sampling vs tracing
●
Sampling
●
Create a periodic timed interrupt that collects the
current program counter, function address and the
entire stack back trace
●
Tracing
●
Record times and invocations of specific events
Generic performance analysis tools
●
uptime → system lifetime and load average
●
top → generic overall system stat
●
vmstat 1 → system/memory stat by time
●
mpstat -P ALL 1 → CPU load balancing
●
pidstat 1 → process usage
●
iostat -kxd 1 → disk I/O
●
free -m → memory usage
●
sar -n DEV 1 → network I/O
●
dmesg | tail → last kernel error messages
Sampling technologies
perf
●
perf is a powerful multi-tool and profiler
●
Interval sampling
●
CPU performance counter events
●
user + kernel sampling and tracing
●
event filtering
●
perf top → best tool to get an idea of what’s
going on in the system
Visualizing traces: flame graphs
●
CPU flame graphs
●
x-axis
sample population
●
y-axis
●
stack depth
●
Wider boxes =
More samples =
More time spent on CPU
Tracing technologies
strace
●
strace(1): system call tracer in Linux
●
It uses the ptrace() system call that pauses the
target process for each syscall so that the
debugger can read the state
●
And it’s doing this twice: when the syscall begins
and when it ends!
strace overhead
### Regular execution ###
righiandr@Dell:~$ dd if=/dev/zero of=/dev/null bs=1 count=500k
512000+0 records in
512000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 0,201641 s, 2,5 MB/s
### Strace execution (tracing a syscall that is never called) ###
righiandr@Dell:~$ strace -eaccept dd if=/dev/zero of=/dev/null bs=1 count=500k
512000+0 records in
512000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 11,7989 s, 43,4 kB/s
+++ exited with 0 +++
Tracepoint
●
A tracepoint is special code statically placed in
your program (programmer defines where to put
the tracepoint)
●
If someone wants to see when the tracepoint is
hit and extract data they can “enable” or “activate”
the tracepoint using a specific interface
●
Two elements are required:
●
Tracepoint definition (placed in a header file)
●
Tracepoint statement (in C code)
Tracepoint example
TRACE_EVENT(ext4_free_inode,
TP_PROTO(struct inode *inode),
TP_ARGS(inode),
TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
__field( uid_t, uid )
__field( gid_t, gid )
__field( __u64, blocks )
__field( __u16, mode )
),
TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
__entry->uid = i_uid_read(inode);
__entry->gid = i_gid_read(inode);
__entry->blocks = inode->i_blocks;
__entry->mode = inode->i_mode;
),
TP_printk("dev %d,%d ino %lu mode 0%o uid %u gid %u blocks %llu",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino, __entry->mode,
__entry->uid, __entry->gid, __entry->blocks)
);
Kprobes (Kernel probes)
●
Trap almost every kernel code address, specifying a handler routine to be
invoked when the breakpoint is hit
●
How does it work?
●
Make a copy of the probed instruction and replace the original instruction with a
breakpoint instruction (int3 on x86)
●
When the breakpoint is hit, a trap occurs, CPU's registers are saved and the control
passes to the Kprobes pre-handler
●
The saved instruction is executed in single-step mode
●
The Kprobes post-handler is executed
●
The rest of the original function is executed
●
Same mechanism can be applied to user-space
●
uprobes
Kprobe example: stack trace
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>
static const char function_name[] = "schedule_timeout";
static int my_handler(struct kprobe *p, struct pt_regs *regs)
{
dump_stack();
printk(KERN_INFO "%s called %s(%d)n",
current->comm, function_name, (int)regs->di);
}
static struct kprobe my_kp = {
.pre_handler = my_wrapper,
.symbol_name = function_name,
};
static int __init my_kprobe_init(void)
{
return register_kprobe(&my_kp);
}
static void __exit my_kprobe_exit(void)
{
unregister_kprobe(&my_kp);
}
Example: kprobe / uprobe
●
Example (kprobe)
$ sudo ./bin/kprobe 'p:do_sys_open filename=+0(%si):string'
$ sudo ./bin/kprobe 'p:SyS_execve filename=+0(%di):string'
●
Example (uprobe)
$ sudo ./bin/uprobe 'r:bash:readline +0($retval):string'
$ sudo ./bin/uprobe 'p:/lib/x86_64-linux-gnu/libc-2.23.so:system +0(%di):string'
$ sudo ./bin/uprobe 'p:/lib/x86_64-linux-gnu/libc-2.23.so:malloc size=%di'
●
Tracing format
$ sudo cat /sys/kernel/debug/tracing/trace
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
More advanced examples
●
Access complex data struct via kprobe and
perf probe:
$ sudo -i perf probe --vmlinux=/home/righiandr/linux/vmlinux 
-nv 'netif_receive_skb skb->dev->name'
...
Writing event: p:probe/netif_receive_skb _text+7991520 name=+0(+16(%di))
...
$ sudo ./bin/kprobe 'p:netif_receive_skb name=+0(+16(%di)):string'
Tracing overhead
●
strace: high overhead
●
tracepoints: low overhead
●
kprobes/uprobes: very low overhead
Efficient profiling: eBPF
eBPF: definition
●
eBPF: a highly efficient virtual machine that
lives in the kernel
●
Ingo Molnar described eBPF as
● “One of the more interesting features in this cycle is the
ability to attach eBPF programs (user-defined, sandboxed
bytecode executed by the kernel) to kprobes. This allows
user-defined instrumentation on a live kernel image that can
never crash, hang or interfere with the kernel negatively”
eBPF history
●
Initially it was BPF: Berkeley Packet Filter
●
It has its roots in BSD in the very early 1990’s
●
Originally designed as a mechanism for fast filtering network packets
●
Initially used in Linux by tcpdump to implement the filtering “engine”
behind its complex command-line syntax
●
Linux introduced eBPF: extended Berkeley Packet Filter (3.18 –
December 2014)
●
More efficient / more generic than the original BPF
●
Kernel 4.9: eBPF programs can be attached to perf_events
●
Timed samples can now run BPF programs!
eBPF as a VM
●
Example assembly of a simple eBPF filter
●
Load 16-bit quantity from offset 12 in the packet to the
accumulator (ethernet type)
●
Compare the value to see if the packet is an IP packet
●
If the packet is IP, return TRUE (packet is accepted)
●
otherwise return 0 (packet is rejected)
●
Only 4 VM instructions to filter IP packets!
ldh [12]
jeq #ETHERTYPE_IP, l1, l2
l1: ret #TRUE
l2: ret #0
eBPF context
●
eBPF is not specific to any particular context
●
packet filtering: context is a packet
●
tracing: context is a snapshot of processor registers when the tracepoint is hit
●
JIT:
●
every BPF instruction is mapped to a x86 instruction sequence
●
accumulator and index registers stored directly into processor’s registers
●
program is placed in a vmalloc() space and executed directly when a context
is processed
How to write a eBPF filter
●
A filter can be written in C
●
GCC backend as well as LLVM
backend
●
Compiler generates eBPF byte
code which resides in an ELF file
●
Load the program into the kernel
by using the bpf() syscall
/*
* tracing filter example to print events
* for loobpack device only if attached to
* netif_receive_skb()
*/
#include <linux/skbuff.h>
#include <linux/netdevice.h>
#include <linux/bpf.h>
#include <trace/bpf_trace.h>
void filter(struct bpf_context *ctx)
{
char devname[4] = "lo";
struct net_device *dev;
struct sk_buff *skb = 0;
skb = (struct sk_buff *)ctx->regs.si;
dev = bpf_load_pointer(&skb->dev);
if (bpf_memcmp(dev->name, devname, 2) == 0) {
char fmt[] = "skb %p dev %p n";
bpf_trace_printk(fmt, sizeof(fmt),
(long)skb, (long)dev, 0);
}
}
Source: https://www.goodreads.com/author_blog_posts/14131100-linux-4-9-s-efficient-bpf-profiler
Thread-injection profiling
Parasite thread injection
●
Concept of parasite thread injection introduced in Linux 3.4
(via PTRACE_SEIZE)
●
Attach to the target pid without stopping it and becoming a
“parasite” thread of pid
●
Original goal: freeze and restore TCP connections during
checkpoint/restart
●
Example
●
python-pyrasite: injecting code into running Python programs
References
●
Brendan Gregg blog
●
http://brendangregg.com/blog/
●
BCC tools
●
https://github.com/iovisor/bcc
●
Perf-tools
●
https://github.com/brendangregg/perf-tools
●
Perf-labs
●
https://github.com/brendangregg/perf-labs
●
Linux documentation
●
http://lxr.linux.no/linux/Documentation/trace
●
http://lxr.linux.no/linux/Documentation/kprobes.txt
●
The BSD Packet Filter: A New Architecture for User-level Packet Capture -
S. McCanne and V. Jacobson
●
http://www.tcpdump.org/papers/bpf-usenix93.pdf
●
Linux weekly news
●
http://lwn.net
Thanks
●
@arighi
●
andrea@betterservers.com
Linux kernel tracing superpowers in the cloud

More Related Content

What's hot (20)

PDF
LPC2019 BPF Tracing Tools
Brendan Gregg
 
PDF
bcc/BPF tools - Strategy, current tools, future challenges
IO Visor Project
 
PDF
BPF / XDP 8월 세미나 KossLab
Taeung Song
 
PDF
Performance Wins with eBPF: Getting Started (2021)
Brendan Gregg
 
PDF
Profiling your Applications using the Linux Perf Tools
emBO_Conference
 
PDF
Linux kernel-rootkit-dev - Wonokaerun
idsecconf
 
PDF
Security Monitoring with eBPF
Alex Maestretti
 
PDF
Kernel Recipes 2017: Performance Analysis with BPF
Brendan Gregg
 
PDF
Systems@Scale 2021 BPF Performance Getting Started
Brendan Gregg
 
PDF
Performance Wins with BPF: Getting Started
Brendan Gregg
 
PDF
eBPF Perf Tools 2019
Brendan Gregg
 
PDF
Bpf performance tools chapter 4 bcc
Viller Hsiao
 
PPTX
Understanding eBPF in a Hurry!
Ray Jenkins
 
PDF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Brendan Gregg
 
PDF
LSFMM 2019 BPF Observability
Brendan Gregg
 
PDF
Debugging node in prod
Yunong Xiao
 
PDF
YOW2020 Linux Systems Performance
Brendan Gregg
 
PDF
Container Performance Analysis
Brendan Gregg
 
PDF
Kernel Recipes 2017: Using Linux perf at Netflix
Brendan Gregg
 
PDF
Building Network Functions with eBPF & BCC
Kernel TLV
 
LPC2019 BPF Tracing Tools
Brendan Gregg
 
bcc/BPF tools - Strategy, current tools, future challenges
IO Visor Project
 
BPF / XDP 8월 세미나 KossLab
Taeung Song
 
Performance Wins with eBPF: Getting Started (2021)
Brendan Gregg
 
Profiling your Applications using the Linux Perf Tools
emBO_Conference
 
Linux kernel-rootkit-dev - Wonokaerun
idsecconf
 
Security Monitoring with eBPF
Alex Maestretti
 
Kernel Recipes 2017: Performance Analysis with BPF
Brendan Gregg
 
Systems@Scale 2021 BPF Performance Getting Started
Brendan Gregg
 
Performance Wins with BPF: Getting Started
Brendan Gregg
 
eBPF Perf Tools 2019
Brendan Gregg
 
Bpf performance tools chapter 4 bcc
Viller Hsiao
 
Understanding eBPF in a Hurry!
Ray Jenkins
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Brendan Gregg
 
LSFMM 2019 BPF Observability
Brendan Gregg
 
Debugging node in prod
Yunong Xiao
 
YOW2020 Linux Systems Performance
Brendan Gregg
 
Container Performance Analysis
Brendan Gregg
 
Kernel Recipes 2017: Using Linux perf at Netflix
Brendan Gregg
 
Building Network Functions with eBPF & BCC
Kernel TLV
 

Similar to Linux kernel tracing superpowers in the cloud (20)

PDF
Linux Tracing Superpowers by Eugene Pirogov
Pivorak MeetUp
 
PDF
Kernel bug hunting
Andrea Righi
 
PDF
Dynamic Instrumentation- OpenEBS Golang Meetup July 2017
OpenEBS
 
PDF
Linux kernel bug hunting
Andrea Righi
 
PDF
bpftrace - Tracing Summit 2018
AlastairRobertson9
 
PDF
Andrea Righi - Spying on the Linux kernel for fun and profit
linuxlab_conf
 
PDF
Linux Performance Analysis: New Tools and Old Secrets
Brendan Gregg
 
PPTX
Modern Linux Tracing Landscape
Kernel TLV
 
PDF
Interruption Timer Périodique
Anne Nicolas
 
PDF
A22 Introduction to DTrace by Kyle Hailey
Insight Technology, Inc.
 
PDF
Linux BPF Superpowers
Brendan Gregg
 
PDF
BPF: Tracing and more
Brendan Gregg
 
PPTX
Modern Linux Tracing Landscape
Sasha Goldshtein
 
PDF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
Brendan Gregg
 
PDF
Linux 4.x Tracing Tools: Using BPF Superpowers
Brendan Gregg
 
PDF
Linux Perf Tools
Raj Pandey
 
PDF
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
Valeriy Kravchuk
 
PPTX
The Next Linux Superpower: eBPF Primer
Sasha Goldshtein
 
PDF
Solaris DTrace, An Introduction
satyajit_t
 
PDF
Trace kernel code tips
Viller Hsiao
 
Linux Tracing Superpowers by Eugene Pirogov
Pivorak MeetUp
 
Kernel bug hunting
Andrea Righi
 
Dynamic Instrumentation- OpenEBS Golang Meetup July 2017
OpenEBS
 
Linux kernel bug hunting
Andrea Righi
 
bpftrace - Tracing Summit 2018
AlastairRobertson9
 
Andrea Righi - Spying on the Linux kernel for fun and profit
linuxlab_conf
 
Linux Performance Analysis: New Tools and Old Secrets
Brendan Gregg
 
Modern Linux Tracing Landscape
Kernel TLV
 
Interruption Timer Périodique
Anne Nicolas
 
A22 Introduction to DTrace by Kyle Hailey
Insight Technology, Inc.
 
Linux BPF Superpowers
Brendan Gregg
 
BPF: Tracing and more
Brendan Gregg
 
Modern Linux Tracing Landscape
Sasha Goldshtein
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
Brendan Gregg
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Brendan Gregg
 
Linux Perf Tools
Raj Pandey
 
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
Valeriy Kravchuk
 
The Next Linux Superpower: eBPF Primer
Sasha Goldshtein
 
Solaris DTrace, An Introduction
satyajit_t
 
Trace kernel code tips
Viller Hsiao
 
Ad

Recently uploaded (20)

PPTX
Automatic_Iperf_Log_Result_Excel_visual_v2.pptx
Chen-Chih Lee
 
PDF
Difference Between Kubernetes and Docker .pdf
Kindlebit Solutions
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PPTX
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
PPTX
EO4EU Ocean Monitoring: Maritime Weather Routing Optimsation Use Case
EO4EU
 
PDF
interacting-with-ai-2023---module-2---session-3---handout.pdf
cniclsh1
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PDF
2025年 Linux 核心專題: 探討 sched_ext 及機器學習.pdf
Eric Chou
 
PPTX
Quality on Autopilot: Scaling Testing in Uyuni
Oscar Barrios Torrero
 
PDF
From Chaos to Clarity: Mastering Analytics Governance in the Modern Enterprise
Wiiisdom
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Automatic_Iperf_Log_Result_Excel_visual_v2.pptx
Chen-Chih Lee
 
Difference Between Kubernetes and Docker .pdf
Kindlebit Solutions
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
EO4EU Ocean Monitoring: Maritime Weather Routing Optimsation Use Case
EO4EU
 
interacting-with-ai-2023---module-2---session-3---handout.pdf
cniclsh1
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
2025年 Linux 核心專題: 探討 sched_ext 及機器學習.pdf
Eric Chou
 
Quality on Autopilot: Scaling Testing in Uyuni
Oscar Barrios Torrero
 
From Chaos to Clarity: Mastering Analytics Governance in the Modern Enterprise
Wiiisdom
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
Human Resources Information System (HRIS)
Amity University, Patna
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Ad

Linux kernel tracing superpowers in the cloud

  • 1. Linux kernel tracing superpowers in the cloud Andrea Righi [email protected] @arighi
  • 2. Who am I? ● Andrea Righi ● Performance engineer @ BetterServers.com ● My main activities ● Linux kernel stuff ● Virtualization ● Storage ● Cloud computing
  • 7. Drunk-man anti-methodology ● Tune random things until the problem goes away
  • 8. Blame someone else anti-methodology ● Find a component X that you are not responsible for and redirect problems to component X
  • 10. CPU sampling vs tracing ● Sampling ● Create a periodic timed interrupt that collects the current program counter, function address and the entire stack back trace ● Tracing ● Record times and invocations of specific events
  • 11. Generic performance analysis tools ● uptime → system lifetime and load average ● top → generic overall system stat ● vmstat 1 → system/memory stat by time ● mpstat -P ALL 1 → CPU load balancing ● pidstat 1 → process usage ● iostat -kxd 1 → disk I/O ● free -m → memory usage ● sar -n DEV 1 → network I/O ● dmesg | tail → last kernel error messages
  • 13. perf ● perf is a powerful multi-tool and profiler ● Interval sampling ● CPU performance counter events ● user + kernel sampling and tracing ● event filtering ● perf top → best tool to get an idea of what’s going on in the system
  • 14. Visualizing traces: flame graphs ● CPU flame graphs ● x-axis sample population ● y-axis ● stack depth ● Wider boxes = More samples = More time spent on CPU
  • 16. strace ● strace(1): system call tracer in Linux ● It uses the ptrace() system call that pauses the target process for each syscall so that the debugger can read the state ● And it’s doing this twice: when the syscall begins and when it ends!
  • 17. strace overhead ### Regular execution ### righiandr@Dell:~$ dd if=/dev/zero of=/dev/null bs=1 count=500k 512000+0 records in 512000+0 records out 512000 bytes (512 kB, 500 KiB) copied, 0,201641 s, 2,5 MB/s ### Strace execution (tracing a syscall that is never called) ### righiandr@Dell:~$ strace -eaccept dd if=/dev/zero of=/dev/null bs=1 count=500k 512000+0 records in 512000+0 records out 512000 bytes (512 kB, 500 KiB) copied, 11,7989 s, 43,4 kB/s +++ exited with 0 +++
  • 18. Tracepoint ● A tracepoint is special code statically placed in your program (programmer defines where to put the tracepoint) ● If someone wants to see when the tracepoint is hit and extract data they can “enable” or “activate” the tracepoint using a specific interface ● Two elements are required: ● Tracepoint definition (placed in a header file) ● Tracepoint statement (in C code)
  • 19. Tracepoint example TRACE_EVENT(ext4_free_inode, TP_PROTO(struct inode *inode), TP_ARGS(inode), TP_STRUCT__entry( __field( dev_t, dev ) __field( ino_t, ino ) __field( uid_t, uid ) __field( gid_t, gid ) __field( __u64, blocks ) __field( __u16, mode ) ), TP_fast_assign( __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; __entry->uid = i_uid_read(inode); __entry->gid = i_gid_read(inode); __entry->blocks = inode->i_blocks; __entry->mode = inode->i_mode; ), TP_printk("dev %d,%d ino %lu mode 0%o uid %u gid %u blocks %llu", MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, __entry->mode, __entry->uid, __entry->gid, __entry->blocks) );
  • 20. Kprobes (Kernel probes) ● Trap almost every kernel code address, specifying a handler routine to be invoked when the breakpoint is hit ● How does it work? ● Make a copy of the probed instruction and replace the original instruction with a breakpoint instruction (int3 on x86) ● When the breakpoint is hit, a trap occurs, CPU's registers are saved and the control passes to the Kprobes pre-handler ● The saved instruction is executed in single-step mode ● The Kprobes post-handler is executed ● The rest of the original function is executed ● Same mechanism can be applied to user-space ● uprobes
  • 21. Kprobe example: stack trace #include <linux/kernel.h> #include <linux/module.h> #include <linux/kprobes.h> static const char function_name[] = "schedule_timeout"; static int my_handler(struct kprobe *p, struct pt_regs *regs) { dump_stack(); printk(KERN_INFO "%s called %s(%d)n", current->comm, function_name, (int)regs->di); } static struct kprobe my_kp = { .pre_handler = my_wrapper, .symbol_name = function_name, }; static int __init my_kprobe_init(void) { return register_kprobe(&my_kp); } static void __exit my_kprobe_exit(void) { unregister_kprobe(&my_kp); }
  • 22. Example: kprobe / uprobe ● Example (kprobe) $ sudo ./bin/kprobe 'p:do_sys_open filename=+0(%si):string' $ sudo ./bin/kprobe 'p:SyS_execve filename=+0(%di):string' ● Example (uprobe) $ sudo ./bin/uprobe 'r:bash:readline +0($retval):string' $ sudo ./bin/uprobe 'p:/lib/x86_64-linux-gnu/libc-2.23.so:system +0(%di):string' $ sudo ./bin/uprobe 'p:/lib/x86_64-linux-gnu/libc-2.23.so:malloc size=%di' ● Tracing format $ sudo cat /sys/kernel/debug/tracing/trace # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION
  • 23. More advanced examples ● Access complex data struct via kprobe and perf probe: $ sudo -i perf probe --vmlinux=/home/righiandr/linux/vmlinux -nv 'netif_receive_skb skb->dev->name' ... Writing event: p:probe/netif_receive_skb _text+7991520 name=+0(+16(%di)) ... $ sudo ./bin/kprobe 'p:netif_receive_skb name=+0(+16(%di)):string'
  • 24. Tracing overhead ● strace: high overhead ● tracepoints: low overhead ● kprobes/uprobes: very low overhead
  • 26. eBPF: definition ● eBPF: a highly efficient virtual machine that lives in the kernel ● Ingo Molnar described eBPF as ● “One of the more interesting features in this cycle is the ability to attach eBPF programs (user-defined, sandboxed bytecode executed by the kernel) to kprobes. This allows user-defined instrumentation on a live kernel image that can never crash, hang or interfere with the kernel negatively”
  • 27. eBPF history ● Initially it was BPF: Berkeley Packet Filter ● It has its roots in BSD in the very early 1990’s ● Originally designed as a mechanism for fast filtering network packets ● Initially used in Linux by tcpdump to implement the filtering “engine” behind its complex command-line syntax ● Linux introduced eBPF: extended Berkeley Packet Filter (3.18 – December 2014) ● More efficient / more generic than the original BPF ● Kernel 4.9: eBPF programs can be attached to perf_events ● Timed samples can now run BPF programs!
  • 28. eBPF as a VM ● Example assembly of a simple eBPF filter ● Load 16-bit quantity from offset 12 in the packet to the accumulator (ethernet type) ● Compare the value to see if the packet is an IP packet ● If the packet is IP, return TRUE (packet is accepted) ● otherwise return 0 (packet is rejected) ● Only 4 VM instructions to filter IP packets! ldh [12] jeq #ETHERTYPE_IP, l1, l2 l1: ret #TRUE l2: ret #0
  • 29. eBPF context ● eBPF is not specific to any particular context ● packet filtering: context is a packet ● tracing: context is a snapshot of processor registers when the tracepoint is hit ● JIT: ● every BPF instruction is mapped to a x86 instruction sequence ● accumulator and index registers stored directly into processor’s registers ● program is placed in a vmalloc() space and executed directly when a context is processed
  • 30. How to write a eBPF filter ● A filter can be written in C ● GCC backend as well as LLVM backend ● Compiler generates eBPF byte code which resides in an ELF file ● Load the program into the kernel by using the bpf() syscall /* * tracing filter example to print events * for loobpack device only if attached to * netif_receive_skb() */ #include <linux/skbuff.h> #include <linux/netdevice.h> #include <linux/bpf.h> #include <trace/bpf_trace.h> void filter(struct bpf_context *ctx) { char devname[4] = "lo"; struct net_device *dev; struct sk_buff *skb = 0; skb = (struct sk_buff *)ctx->regs.si; dev = bpf_load_pointer(&skb->dev); if (bpf_memcmp(dev->name, devname, 2) == 0) { char fmt[] = "skb %p dev %p n"; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)dev, 0); } }
  • 33. Parasite thread injection ● Concept of parasite thread injection introduced in Linux 3.4 (via PTRACE_SEIZE) ● Attach to the target pid without stopping it and becoming a “parasite” thread of pid ● Original goal: freeze and restore TCP connections during checkpoint/restart ● Example ● python-pyrasite: injecting code into running Python programs
  • 34. References ● Brendan Gregg blog ● http://brendangregg.com/blog/ ● BCC tools ● https://github.com/iovisor/bcc ● Perf-tools ● https://github.com/brendangregg/perf-tools ● Perf-labs ● https://github.com/brendangregg/perf-labs ● Linux documentation ● http://lxr.linux.no/linux/Documentation/trace ● http://lxr.linux.no/linux/Documentation/kprobes.txt ● The BSD Packet Filter: A New Architecture for User-level Packet Capture - S. McCanne and V. Jacobson ● http://www.tcpdump.org/papers/bpf-usenix93.pdf ● Linux weekly news ● http://lwn.net