Linux extended Berkeley Packet Filters
Be kind to the WiFi!
Be kind with others
Thank you!
Slides: https://workshop.bpf.sh/
A machine with Linux Kernel 4.18+ (or the provided Vagrant machine)
Be comfortable with the UNIX command line
navigating directories
editing files
a little bit of bash-fu (environment variables, loops)
Tell me and I forget.
Teach me and I remember.
Involve me and I learn.
Misattributed to Benjamin Franklin
(Probably inspired by Chinese Confucian philosopher Xunzi)
Once in a while, the instructions will say:
“Open a new terminal.”
There are multiple ways to do this:
create a new window or tab on your machine, and SSH into the VM;
use screen or tmux on the VM and open a new window from there;
Or if you are executing in a local Linux machine just open a new terminal in there;
You are welcome to use the method that you feel the most comfortable with.
Tmux is a terminal multiplexer like screen
.
You don’t have to use it or even know about it to follow along.
But some of us like to use it to switch between terminals.
It comes preinstalled in the Vagrant machine we provided
Vagrantfile
.brew cask install virtualboxbrew cask install vagrant
Windows
Download Vagrant and Virtualbox
Ubuntu
apt install vagrant virtualbox
Make sure to have:
The other tools we need, bpftrace
and bcc
will have their own setup
instructions in the respective chapters.
After cloning the workshop repository, enter the environment folder:
git clone https://github.com/bpftools/bpf-workshop.gitcd bpf-workshop/environment
Then there are three major things you can do:
# Start the environmentvagrant up# Stop the environmentvagrant halt# Destroy the environmentvagrant destroy# Obtain a shellvagrant ssh
The whole workshop is hands-on
We are going to write some eBPF programs
All hands-on sections are clearly identified, like the gray rectangle below
The BPF in Kernel Virtual Machine
Many different types of maps
BPF_MAP_TYPE_ARRAY_OF_MAPS
and BPF_MAP_TYPE_HASH_OF_MAPS
man 2 bpf
Maps operations
bpf_map_lookup_elem
bpf_map_delete_element
bpf_map_update_elem
bpf_map_get_next_key
bpf_map_lookup_and_delete_element
bpf_spin_lock
that is essentially a semaphore;bpf_trace_printk
and bpf_get_current_pid_tgid
bpf_perf_event_output
#include #define SEC(NAME) __attribute__((section(NAME), used))SEC("tracepoint/syscalls/sys_enter_execve")int bpf_prog(void *ctx) { char msg[] = "Hello, BPF World!"; bpf_trace_printk(msg, sizeof(msg)); return 0;}char _license[] SEC("license") = "GPL";
clang -O2 -target bpf -c hello_world_kern.c -o hello_world_kern.o
#include #include "bpf_load.h"int main(int argc, char **argv) { if (load_bpf_file("hello_world_kern.o") != 0) { printf("The kernel didn't load the BPF program\\n"); return -1; } read_trace_pipe(); return 0;}
bcc/examples
folder;hello_world.py
tool;source = """int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello, World!\n"); return 0;}"""BPF(text = source).trace_print()
bpf_source = """#include BPF_PERF_OUTPUT(events);struct data_t { char comm[16];};"""
bpf_source += """int on_execve(struct pt_regs *ctx, const char __user *filename, const char __user *const __user *__argv, const char __user *const __user *__envp){ struct data_t data = {}; bpf_get_current_comm(&data.comm, sizeof(data.comm)); events.perf_submit(ctx, &data, sizeof(data)); return 0;}"""
from bcc import BPFfrom bcc.utils import printbdef dump_data(cpu, data, size): event = bpf["events"].event(data) printb(b"%-16s" % event.comm)bpf = BPF(text = bpf_source)execve_function = bpf.get_syscall_fnname("execve")bpf.attach_kprobe(event = execve_function, fn_name = "on_execve")bpf["events"].open_perf_buffer(dump_data)while 1: bpf.perf_buffer_poll()
sudo tools/profile -p PID
git clone https://github.com/brendangregg/FlameGraph
sudo tools/profile -p PID -f > /tmp/profile.outflamegraph.pl /tmp/profile.out > /tmp/profile-graph.svg \ && firefox /tmp/profile-graph.svg
On GitHub https://github.com/iovisor/bpftrace
What it is:
What it is NOT:
We will need to do some exercises with bpftrace. If you are not using the Vagrant environment, you might want to install it now!
Ubuntu snap package
sudo snap install --devmode bpftracesudo snap connect bpftrace:system-trace
Fedora (28 or later)
sudo dnf install bpftrace
You can find further instructions here
full | shortcut | Description |
---|---|---|
tracepoint | t | Kernel static tracepoints |
usdt | U | User-level statically defined tracing |
kprobe | k | Kernel function tracing |
kretprobe | kr | Kernel function returns |
uprobe | u | User-level function tracing |
uretprobe | ur | User-level function returns |
profile | p | Timed sampling across all CPUs |
interval | i | Interval output |
software | s | Kernel software events |
hardware | h | Processor hardware events |
Per-event output
Map Summaries
function | description |
---|---|
hist(int n) | Produce a log2 histogram of values of n |
lhist(int n# int min# int max# int step) | Produce a linear histogram of values of n |
count() | Count the number of times this function is called |
sum(int n) | Sum this value |
min(int n) | Record the minimum value seen |
max(int n) | Record the maximum value seen |
avg(int n) | Average this value |
stats(int n) | Return the count# average# and total for this value |
delete(@x) | Delete the map element passed in as an argument |
str(char *s [# int length]) | Returns the string pointed to by s |
printf(char *fmt# …) | Print formatted to stdout |
function | description |
---|---|
print(@x[# int top [# int div]]) | Print a map# with optional top entry count and divisor |
clear(@x) | Delete all key/values from a map |
sym(void *p) | Resolve kernel address |
usym(void *p) | Resolve user space address |
ntop([int af# ]int | char[4 |
kaddr(char *name) | Resolve kernel symbol name |
uaddr(char *name) | Resolve user space symbol name |
reg(char *name) | Returns the value stored in the named register |
join(char *arr[] [# char *delim]) | Prints the string array |
time(char *fmt) | Print the current time |
cat(char *filename) | Print file content |
system(char *fmt) | Execute shell command |
exit() | Quit bpftrace |
Basic Variables
Associative Arrays
Buitins
variable | description |
---|---|
tid | Thread ID (kernel pid) |
cgroup | Cgroup ID of the current process |
uid | User ID |
gid | Group ID |
nsecs | Nanosecond timestamp |
elapsed | Nanosecond timestamp since bpftrace initialization |
cpu | Processor ID |
comm | Process name |
variable | description |
---|---|
pid | Process ID (kernel tgid) |
stack | Kernel stack trace |
ustack | User stack trace |
arg0, arg1, … etc. | Arguments to the function being traced |
retval | Return value from function being traced |
func | Name of the function currently being traced |
probe | Full name of the probe |
curtask | Current task_struct as a u64 |
rand | Random number of type u32 |
$1, $2, … etc. | Positional parameters to the bpftrace program |
tools
folder git clone https://github.com/iovisor/bpftrace.git cd bpftrace/tools
tcpretrans.bt
;tcpretrans.bt
does just thatbpftrace/tools
folder;tcpretrans.bt
tool; bpftrace tcpretrans.bt
tcpretrans.bt
active! telnet bpf.sh 9090
vfs_read
function in the kernel using a kretprobe
;bytes
that will dump a linear histogram where the arguments are: value, min, max, step. The first argument (retval) of vfs_read() is the return value: the number of bytes read;Ctrl-C
to dump the resultsbpftrace -e 'kretprobe:vfs_read { @bytes = lhist(retval, 0, 2000, 200); }'
In Linux, all files are accessed through the Virtual Filesystem Switch, or VFS, a layer of code which implements generic filesystem actions and vectors requests to the correct specific code to handle the request.
kretprobe
in the previous exercisebpftrace -e 'tracepoint:syscalls:sys_exit_read { @bytes = lhist(args->ret, 0, 2000, 200); }'
Ctrl-C
to dump the results
What’s the difference?
While being very powerful (it can trace any kernel function), kretprobe
approach can’t be considered “stable”, because internal
kernel functions can change between kernels. On the other hand using a tracepoint is a much more stable approach because tracepoints
are considered as a user facing feature and not an internal one by kernel developers.
Whenever possible use tracepoints instead of kprobe/kretprobe.
We have a Go program that prints a random number every second.
package mainimport( "time" "fmt" "math/rand")func main() { for { time.Sleep(time.Second * 1) fmt.Printf("%d\n", giveMeNumber()) }}func giveMeNumber() int { return rand.Intn(100) + rand.Intn(900)}
We want to get the random number out of it using a bpftrace program.
main.go
with the code from previous slide; go build -o randomnumbers main.go
randomnumbers
in the current folder;./randomnumbers
;uretprobe
:bpftrace -e \ 'uretprobe:./randomnumbers:"main.giveMeNumber" { printf("%d\n", retval) }'
Bonus point! Try to do an objdump -t randomnumbers | grep -i giveMe
, what do you notice?
bpftrace
;bpftrace
can be used only for eBPF based tracing;bpftrace/tools
folder, saves a lot of time;apiVersion: v1kind: Podmetadata: name: happy-borgspec: shareProcessNamespace: true containers: - name: execsnoop image: calavera/execsnoop securityContext: - privileged: true volumeMounts: - name: sys # mount the debug filesystem mountPath: /sys readOnly: true - name: headers # mount the kernel headers required by bcc mountPath: /usr/src readOnly: true - name: modules # mount the kernel modules required by bcc mountPath: /lib/modules readOnly: true - name: container doing random work ...
Main use cases
At different levels
BPF_PROG_TYPE_SOCKET_FILTER
)tcpdump
;Tcpdump
libpcap
;In a new terminal, execute tcpdump
with a filter and use the -d
option to dump the generated BPF assembly.
tcpdump -d 'ip and tcp port 8080'
What do you see? Anything noteworthy?
sk_buff
structHere’s an example, the program type is given by the SEC("socket")
definition that gets translated to BPF_PROG_TYPE_SOCKET_FILTER
.
SEC("socket")int socket_prog(struct __sk_buff *skb) { int proto = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); int one = 1; int *el = bpf_map_lookup_elem(&countmap, &proto); if (el) { (*el)++; } else { el = &one; } bpf_map_update_elem(&countmap, &proto, el, BPF_ANY); return 0;}
cls_bpf
;Among tc use cases there are:
sk_buff
Here’s a diagram showing the interactions:
SEC("classifier")static inline int classification(struct __sk_buff *skb) { void *data_end = (void *)(long)skb->data_end; void *data = (void *)(long)skb->data; struct ethhdr *eth = data; __u16 h_proto; __u64 nh_off = 0; nh_off = sizeof(*eth); if (data + nh_off > data_end) { return TC_ACT_OK; } h_proto = eth->h_proto; if (h_proto != bpf_htons(ETH_P_IP)) { return TC_ACT_OK; } struct iphdr *iph = data + nh_off; if (iph + 1 > data_end) { return TC_ACT_OK; } if (iph->protocol -= IPPROTO_TCP) { return TC_ACT_SHOT } return TC_ACT_OK;}
The classifier program is added to the qdisc using tc
:
tc filter add dev eth0 ingress bpf obj classifier.o flowid 0:
BPF_PROG_TYPE_XDP
All this comes with advantages and disadvantages:
Being executed even before the kernel code, XDP programs can drop packets in a very efficient way. Compared to tc programs, XDP programs can only be attached to traffic in ingress to the system.
SEC("mysection")int myprogram(struct xdp_md *ctx) { int ipsize = 0; void *data = (void *)(long)ctx->data; void *data_end = (void *)(long)ctx->data_end; struct ethhdr *eth = data; struct iphdr *ip; ipsize = sizeof(*eth); ip = data + ipsize; ipsize += sizeof(struct iphdr); if (data + ipsize > data_end) { return XDP_DROP; } if (ip->protocol == IPPROTO_TCP) { return XDP_DROP; } return XDP_PASS;}
It can be loaded on any interface using:
ip link set dev enp0s8 xdp obj udp.o sec mysection
Here’s the seccomp data structure for filters as from linux/seccomp.h
struct seccomp_data { int nr; __u32 arch; __u64 instruction_pointer; __u64 args[6];};
Allows to filter based on: the system call, its arguments or a combination of them.
git clone https://gist.github.com/fntlnz/08ae20befb91befd9a53cd91cdc6d507 seccomp-exercise cd seccomp-exercise
README.md
, what do you notice?
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |