Linux extended Berkeley Packet Filters

Be kind to the WiFi!
Be kind with others
Thank you!

Slides: https://workshop.bpf.sh/

1 / 88

Hello! We are:
- 🐕 David Calavera (@calavera, Netlify)
- 🐕 Lorenzo Fontana (@fntlnz, Sysdig)
The workshop will run from 9:00am to 12:30pm
Feel free to interrupt for questions at any time

2 / 88

BPF book cover

3 / 88

Pre-requirements

A machine with Linux Kernel 4.18+ (or the provided Vagrant machine)
Be comfortable with the UNIX command line
- navigating directories
- editing files
- a little bit of bash-fu (environment variables, loops)

4 / 88

Tell me and I forget.
Teach me and I remember.
Involve me and I learn.

Misattributed to Benjamin Franklin

(Probably inspired by Chinese Confucian philosopher Xunzi)

5 / 88

Terminals

Once in a while, the instructions will say:
“Open a new terminal.”

There are multiple ways to do this:

create a new window or tab on your machine, and SSH into the VM;
use screen or tmux on the VM and open a new window from there;
Or if you are executing in a local Linux machine just open a new terminal in there;

You are welcome to use the method that you feel the most comfortable with.

6 / 88

Tmux cheatsheet

Tmux is a terminal multiplexer like screen.

You don’t have to use it or even know about it to follow along.
But some of us like to use it to switch between terminals.
It comes preinstalled in the Vagrant machine we provided

Ctrl-b c → creates a new window
Ctrl-b n → go to next window
Ctrl-b p → go to previous window
Ctrl-b “ → split window top/bottom
Ctrl-b % → split window left/right
Ctrl-b Alt-1 → rearrange windows in columns
Ctrl-b Alt-2 → rearrange windows in rows
Ctrl-b arrows → navigate to other windows
Ctrl-b d → detach session
tmux attach → reattach to session

7 / 88

Vagrant

If you don’t know what Vagrant is, don’t worry.
It’s just a tool to create Virtual machines that we use to create a common VM with all the eBPF tools for everyone!
This workshop comes with a reference environment expressed in a Vagrantfile.
You don’t have to use this one, but be prepared to install stuff! OS X

brew cask install virtualbox
brew cask install vagrant

Windows

Download Vagrant and Virtualbox

Ubuntu

apt install vagrant virtualbox

8 / 88

Non-Vagrant (aka. All the other environments)

Make sure to have:

git
an editor of your choice
gcc
clang
go

The other tools we need, bpftrace and bcc will have their own setup instructions in the respective chapters.

9 / 88

Vagrant

After cloning the workshop repository, enter the environment folder:

git clone https://github.com/bpftools/bpf-workshop.git
cd bpf-workshop/environment

Then there are three major things you can do:

# Start the environment
vagrant up
# Stop the environment
vagrant halt
# Destroy the environment
vagrant destroy
# Obtain a shell
vagrant ssh

10 / 88

Hands-on sections

The whole workshop is hands-on
We are going to write some eBPF programs
All hands-on sections are clearly identified, like the gray rectangle below

This is the stuff you’re supposed to do!

11 / 88

Table of content

Introduction
The BPF in Kernel Virtual Machine
BCC
bpftrace
eBPF and Kubernetes
eBPF and Linux Networking
Linux Kernel security and eBPF

12 / 88

Introduction

Previous section | Back to table of contents | Next section

13 / 88

IntroductionThe BSD Packet Filter: A New Architecture for User-level Packet Capture
Virtual Machine to work efficiently with register based CPUs
Packet filtering without copying data
14 / 88

The extended BPF implementation (eBPF)Introduced in 2014 by Alexei Starovoitov
Increased register size from 2 32-bit registers to 10 64-bit registers
Initially designed to optimize network filters
15 / 88

The BPF in Kernel Virtual Machine

Previous section | Back to table of contents | Next section

16 / 88

The BPF in-kernel Virtual MachineImplements a general purpose low level RISC instructions
Runs the instructions in response to events triggered by the kernel
Implements a verifier, so that your programs can’t break the kernel
Has different interfaces for different types of programs
Widely supported in the kernel
Has an upstream LLVM backend, you can compile eBPF code with clang
17 / 88

The BPF in-kernel Virtual Machine

eBPF Virtual Machine Diagram

18 / 88

BPF … eBPF … 🤔BPF is the classic implementation, suitable only for basic filtering, BPF is also referred as cBPF;
The eBPF instruction set is wider than the BPF instruction set;
BPF does not support maps, eBPF does;
eBPF has general purpose registers and a stack, BPF only an accumulator and a scratch memory store;
19 / 88

MapsBPF Maps data stores that live in the kernel;
Can be accessed by any BPF program that knows about them;
Programs that run in user-space can also access these maps by using file descriptors;
You can store any kind of data in a map, as long as you specify the data size correctly before hand;
The kernel treats keys and values as binary blobs and it doesn’t care about what you keep in a map;
This is what we use to let userspace programs to extract or feed information into BPF programs running in the kernel!
20 / 88

Many different types of maps

Hash table: BPF_MAP_TYPE_HASH
Array: BPF_MAP_TYPE_ARRAY
Program array maps: BPF_MAP_TYPE_PROG_ARRAY, this one is magic, allows you to store references to bpf programs so that you can do jumps between bpf programs;
Perf events array maps: BPF_MAP_TYPE_PERF_EVENT_ARRAY
Per-CPU hash maps: BPF_MAP_TYPE_PERCPU_HASH
Per-CPU array maps: BPF_MAP_TYPE_PERCPU_ARRAY
Stack trace maps: BPF_MAP_TYPE_STACK_TRACE
Cgroup array maps: BPF_MAP_TYPE_CGROUP_ARRAY
Hash and per cpu has with LRU cache: BPF_MAP_TYPE_LRU_PERCPU_HASH, BPF_MAP_TYPE_LRU_HASH
Longest Prefix Match(LPM) Trie: BPF_MAP_TYPE_LPM_TRIE
Array of maps, and hash of maps, maps: BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS
And many more! Find all of them man 2 bpf

21 / 88

Maps operations

Lookup a single element value, bpf_map_lookup_elem
Remove an element, bpf_map_delete_element
Iterating over elements
Updating an element, bpf_map_update_elem
Get the next key in the map, bpf_map_get_next_key
Search, get the value and delete in a single atomic operation, bpf_map_lookup_and_delete_element
Concurrent access is regulated using a mechanism called bpf_spin_lock that is essentially a semaphore;

22 / 88

BPF programsCode that’s triggered based on events in the kernel
Context arguments that depend on the event triggered
Must always terminate
Cannot include outbounded control loops
Limited in the number of instructions to execute (changing soon)
Can trigger other BPF programs
23 / 88

BPF program helpers

General helpers available to any program, like bpf_trace_printk and bpf_get_current_pid_tgid
Specialized helper available only to specific types of programs, bpf_perf_event_output
https://github.com/iovisor/bpf-docs/blob/master/bpf_helpers.rst

24 / 88

BPF program typesSocket filtering: BPF_PROG_TYPE_SOCKET_FILTER, BPF_PROG_TYPE_SK_SKB, BPF_PROG_TYPE_SK_MSG, BPF_PROG_TYPE_SK_REUSEPORT
Tracing: BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_RAW_TRACEPOINT
XDP: BPF_PROG_TYPE_XDP
Perf events: BPF_PROG_TYPE_PERF_EVENT
Cgroups: BPF_PROG_TYPE_CGROUP_SKB, BPF_PROG_TYPE_CGROUP_SOCK, BPF_PROG_TYPE_CGROUP_DEVICE, BPF_PROG_TYPE_CGROUP_SOCK_ADDR
Infrared devices: BPF_PROG_TYPE_LIRC_MODE2
25 / 88

BPF program example
#include 
#define SEC(NAME) __attribute__((section(NAME), used))
SEC("tracepoint/syscalls/sys_enter_execve")
int bpf_prog(void *ctx) {
  char msg[] = "Hello, BPF World!";
  bpf_trace_printk(msg, sizeof(msg));
  return 0;
}
char _license[] SEC("license") = "GPL";


26 / 88

BPF program example (part 2)
clang -O2 -target bpf -c hello_world_kern.c -o hello_world_kern.o


27 / 88

BPF program example (part 3)
#include 
#include "bpf_load.h"
int main(int argc, char **argv) {
  if (load_bpf_file("hello_world_kern.o") != 0) {
    printf("The kernel didn't load the BPF program\\n");
    return -1;
  }
  read_trace_pipe();
  return 0;
}


28 / 88

Other resources

29 / 88

BCC

Previous section | Back to table of contents | Next section

30 / 88

The BPF Compiler CollectionToolkit to create and manipulate BPF programs
Connects BPF programs with high level programming languages
C++, Python, Lua, and Go frontends
Dynamic load and unload of BPF programs
31 / 88

BCC included toolsTracing and monitoring
Networking
Introspection
32 / 88

BCC hello worldIn the bcc/examples folder;
With root permissions;
Execute the hello_world.py tool;

33 / 88

BCC hello world destilled

source = """
int kprobe__sys_clone(void *ctx) {
  bpf_trace_printk("Hello, World!\n");
  return 0;
}
"""
BPF(text = source).trace_print()

34 / 88

BCC perf eventsReal time event service between BPF and frontend.
Active buffer polling
35 / 88

BCC perf events exercise (part 1)
bpf_source = """
#include 
BPF_PERF_OUTPUT(events);
struct data_t {
    char comm[16];
};
"""


36 / 88

BCC perf events exercise (part 2)
bpf_source += """
int on_execve(struct pt_regs *ctx,
    const char __user *filename,
    const char __user *const __user *__argv,
    const char __user *const __user *__envp)
{
  struct data_t data = {};
  bpf_get_current_comm(&data.comm, sizeof(data.comm));
  events.perf_submit(ctx, &data, sizeof(data));
  return 0;
}
"""


37 / 88

BCC perf events exercise (part 3)
from bcc import BPF
from bcc.utils import printb
def dump_data(cpu, data, size):
    event = bpf["events"].event(data)
    printb(b"%-16s" % event.comm)
bpf = BPF(text = bpf_source)
execve_function = bpf.get_syscall_fnname("execve")
bpf.attach_kprobe(event = execve_function, fn_name = "on_execve")
bpf["events"].open_perf_buffer(dump_data)
while 1:
    bpf.perf_buffer_poll()


38 / 88

BCC perf events source

https://workshop.bpf.sh/exercises/mini_exec_snoop.py

39 / 88

BCC ProfileSample stack traces to profile CPU data
Observe where a running application is spending CPU time
40 / 88

BCC Profile exercise
sudo tools/profile -p PID


41 / 88

BCC Profile exercise (Part 2)Download the Flamegrapsh scripts:

git clone https://github.com/brendangregg/FlameGraph

Generate a flamegraph for your profiled data:

sudo tools/profile -p PID -f > /tmp/profile.out
flamegraph.pl /tmp/profile.out > /tmp/profile-graph.svg \
  && firefox /tmp/profile-graph.svg

42 / 88

TakeawaysConvenient interop with other languages
Write one time only tools, and background processes
43 / 88

bpftrace

Previous section | Back to table of contents | Next section

44 / 88

bpftrace: BPF observability front-end

On GitHub https://github.com/iovisor/bpftrace

What it is:

Higher level language to write eBPF programs;
Built from the ground-up for BPF and Linux;
Used in production at Netflix, Facebook, etc;
Custom one-liners;
Comes with tools;
It is just for tracing;

What it is NOT:

A framework to build your loaders;
You can’t do classic bpf with it (like seccomp programs or socket probe types);
It does not support traffic control and XDP;

45 / 88

bpftrace: Installation

We will need to do some exercises with bpftrace. If you are not using the Vagrant environment, you might want to install it now!

Ubuntu snap package

sudo snap install --devmode bpftrace
sudo snap connect bpftrace:system-trace

Fedora (28 or later)

sudo dnf install bpftrace

You can find further instructions here

46 / 88

bpftrace: Syntax

bpftrace-syntax

47 / 88

bpftrace: Probes

supported bpf probe types

48 / 88

bpftrace: Probe type shortcuts

full
shortcut
Description



tracepoint
t
Kernel static tracepoints

usdt
U
User-level statically defined tracing

kprobe
k
Kernel function tracing

kretprobe
kr
Kernel function returns

uprobe
u
User-level function tracing

uretprobe
ur
User-level function returns

profile
p
Timed sampling across all CPUs

interval
i
Interval output

software
s
Kernel software events

hardware
h
Processor hardware events

49 / 88

full	shortcut	Description
tracepoint	t	Kernel static tracepoints
usdt	U	User-level statically defined tracing
kprobe	k	Kernel function tracing
kretprobe	kr	Kernel function returns
uprobe	u	User-level function tracing
uretprobe	ur	User-level function returns
profile	p	Timed sampling across all CPUs
interval	i	Interval output
software	s	Kernel software events
hardware	h	Processor hardware events

bpftrace: Filters/pid == 181/
/comm != “sshd”/
/@ts[tid]/
50 / 88

bpftrace: Actions

Per-event output

printf()
system()
join()
time()

Map Summaries

@ = count() or @++
@ = hist()

51 / 88

bpftrace: Functions

function
description



hist(int n)
Produce a log2 histogram of values of n

lhist(int n# int min# int max# int step)
Produce a linear histogram of values of n

count()
Count the number of times this function is called

sum(int n)
Sum this value

min(int n)
Record the minimum value seen

max(int n)
Record the maximum value seen

avg(int n)
Average this value

stats(int n)
Return the count# average# and total for this value

delete(@x)
Delete the map element passed in as an argument

str(char *s [# int length])
Returns the string pointed to by s

printf(char *fmt# …)
Print formatted to stdout

52 / 88

function	description
hist(int n)	Produce a log2 histogram of values of n
lhist(int n# int min# int max# int step)	Produce a linear histogram of values of n
count()	Count the number of times this function is called
sum(int n)	Sum this value
min(int n)	Record the minimum value seen
max(int n)	Record the maximum value seen
avg(int n)	Average this value
stats(int n)	Return the count# average# and total for this value
delete(@x)	Delete the map element passed in as an argument
str(char *s [# int length])	Returns the string pointed to by s
printf(char *fmt# …)	Print formatted to stdout

bpftrace: Functions (cont’d)

function
description



print(@x[# int top [# int div]])
Print a map# with optional top entry count and divisor

clear(@x)
Delete all key/values from a map

sym(void *p)
Resolve kernel address

usym(void *p)
Resolve user space address

ntop([int af# ]int
char[4

kaddr(char *name)
Resolve kernel symbol name

uaddr(char *name)
Resolve user space symbol name

reg(char *name)
Returns the value stored in the named register

join(char *arr[] [# char *delim])
Prints the string array

time(char *fmt)
Print the current time

cat(char *filename)
Print file content

system(char *fmt)
Execute shell command

exit()
Quit bpftrace

53 / 88

function	description
print(@x[# int top [# int div]])	Print a map# with optional top entry count and divisor
clear(@x)	Delete all key/values from a map
sym(void *p)	Resolve kernel address
usym(void *p)	Resolve user space address
ntop([int af# ]int	char[4
kaddr(char *name)	Resolve kernel symbol name
uaddr(char *name)	Resolve user space symbol name
reg(char *name)	Returns the value stored in the named register
join(char arr[] [# char delim])	Prints the string array
time(char *fmt)	Print the current time
cat(char *filename)	Print file content
system(char *fmt)	Execute shell command
exit()	Quit bpftrace

bpftrace: Variable types

Basic Variables

@global
@thread_local[tid]
$scratch

Associative Arrays

@array[key] = value

Buitins

54 / 88

bpftrace: Builtin Variables

variable
description



tid
Thread ID (kernel pid)

cgroup
Cgroup ID of the current process

uid
User ID

gid
Group ID

nsecs
Nanosecond timestamp

elapsed
Nanosecond timestamp since bpftrace initialization

cpu
Processor ID

comm
Process name

55 / 88

variable	description
tid	Thread ID (kernel pid)
cgroup	Cgroup ID of the current process
uid	User ID
gid	Group ID
nsecs	Nanosecond timestamp
elapsed	Nanosecond timestamp since bpftrace initialization
cpu	Processor ID
comm	Process name

bpftrace: Builtin Variables (cont’d)

variable
description



pid
Process ID (kernel tgid)

stack
Kernel stack trace

ustack
User stack trace

arg0, arg1, … etc.
Arguments to the function being traced

retval
Return value from function being traced

func
Name of the function currently being traced

probe
Full name of the probe

curtask
Current task_struct as a u64

rand
Random number of type u32

$1, $2, … etc.
Positional parameters to the bpftrace program

56 / 88

variable	description
pid	Process ID (kernel tgid)
stack	Kernel stack trace
ustack	User stack trace
arg0, arg1, … etc.	Arguments to the function being traced
retval	Return value from function being traced
func	Name of the function currently being traced
probe	Full name of the probe
curtask	Current task_struct as a u64
rand	Random number of type u32
$1, $2, … etc.	Positional parameters to the bpftrace program

bpftrace hands on: Tools!We will clone the bpftrace repository in our Linux machine;
We are not cloning it to install bpftrace itself, but to get all the tools under the tools folder
Clone the bpftrace repo

  git clone https://github.com/iovisor/bpftrace.git
  cd bpftrace/tools


57 / 88

bpftrace hands on: Trace or count TCP retransmitsIn the bpftrace tools folder, there’s a tool called tcpretrans.bt;
TCP wants to make sure that your packet is received with the guarantee that all the received bytes will be identical and in the same order as those sent,
this technique is called positive acknowledgement with re-transmission;
What happens when there are many retransmits is that your system can have a significant overhead, then you want to know when a retransmit occurs, tcpretrans.bt does just that
Retransmits are usually a sign of poor network health, and this tool is
useful for their investigation. Unlike using tcpdump, this tool has very
low overhead, as it only traces the retransmit function. It also prints
additional kernel details: the state of the TCP session at the time of the
retransmit.
58 / 88

bpftrace hands on: Trace or count TCP retransmits (cont’d)In the bpftrace/tools folder;
With root permissions;
Execute the tcpretrans.bt tool;

  bpftrace tcpretrans.bt

Once it’s started, the best way to trigger some retransmits is to try to connect to a closed port;
Try it on a new terminal while leaving tcpretrans.bt active!

  telnet bpf.sh 9090

59 / 88

bpftrace hands on: tracing read bytes using a kretprobe

We will use the capability of bpftrace to instrument the vfs_read function in the kernel using a kretprobe;
We will create an array called bytes that will dump a linear histogram where the arguments are: value, min, max, step. The first argument (retval) of vfs_read() is the return value: the number of bytes read;

Execute this one liner using bpftrace, then let it run for a while then use Ctrl-C to dump the results

bpftrace -e 'kretprobe:vfs_read { @bytes = lhist(retval, 0, 2000, 200); }'

In Linux, all files are accessed through the Virtual Filesystem Switch, or VFS, a layer of code which implements generic filesystem actions and vectors requests to the correct specific code to handle the request.

60 / 88

bpftrace hands on: tracing read bytes using a tracepoint

We want to do the same thing we did with the kretprobe in the previous exercise

Execute this one liner using bpftrace

bpftrace -e 'tracepoint:syscalls:sys_exit_read  { @bytes = lhist(args->ret, 0, 2000, 200); }'

Let it run for a while then use Ctrl-C to dump the results

What’s the difference? While being very powerful (it can trace any kernel function), kretprobe approach can’t be considered “stable”, because internal kernel functions can change between kernels. On the other hand using a tracepoint is a much more stable approach because tracepoints are considered as a user facing feature and not an internal one by kernel developers. Whenever possible use tracepoints instead of kprobe/kretprobe.

61 / 88

bpftrace hands on: reading userspace returns

We have a Go program that prints a random number every second.

package main
import(
  "time"
  "fmt"
  "math/rand"
)
func main() {
  for {
    time.Sleep(time.Second * 1)
    fmt.Printf("%d\n", giveMeNumber())
  }
}
func giveMeNumber() int {
  return rand.Intn(100) + rand.Intn(900)
}

We want to get the random number out of it using a bpftrace program.

62 / 88

bpftrace hands on: reading userspace returns

Create a file named main.go with the code from previous slide;
Then, compile it with:

  go build -o randomnumbers main.go

This will create a binary named randomnumbers in the current folder;
Once that is done, we just start the program ./randomnumbers;
Now, in a new terminal, we instrument the program using bpftrace and a uretprobe:

bpftrace -e \
  'uretprobe:./randomnumbers:"main.giveMeNumber"
  { printf("%d\n", retval) }'

Bonus point! Try to do an objdump -t randomnumbers | grep -i giveMe, what do you notice?

63 / 88

bpftrace: Internals

supported bpf probe types

64 / 88

TakeawaysThere’s an higher level language to use eBPF, called bpftrace;
bpftrace can be used only for eBPF based tracing;
It’s pretty magic;
There are a LOT of premade tools you can use in the bpftrace/tools folder, saves a lot of time;
65 / 88

Credits and References

66 / 88

eBPF and Kubernetes

Previous section | Back to table of contents | Next section

67 / 88

Approach #1: Just use a containerA sidecar container sharing the process namespace;
You just provide an image with eBPF loader as entrypoint;
The loader will just load the program and execute it;
Not extremely generic but does the job!
A very flexible approach!

apiVersion: v1
kind: Pod
metadata:
  name: happy-borg
spec:
  shareProcessNamespace: true
  containers:
  - name: execsnoop
    image: calavera/execsnoop
    securityContext:
    - privileged: true
    volumeMounts:
    - name: sys # mount the debug filesystem
      mountPath: /sys
      readOnly: true
    - name: headers # mount the kernel headers required by bcc
      mountPath: /usr/src
      readOnly: true
    - name: modules # mount the kernel modules required by bcc
      mountPath: /lib/modules
      readOnly: true
  - name: container doing random work
  ...


68 / 88

Approach #2: kubectl-trace

It’s basically bpftrace, but for the kubectl!
It’s on GitHub iovisor/kubectl-trace

img/kubernetes-kubectl-trace.png

69 / 88

eBPF and Linux Networking

Previous section | Back to table of contents | Next section

70 / 88

eBPF and Linux Networking

Main use cases

Retrospective analysis of network traffic captured on a live system, using the pcap format for example;
Live packet filtering, e.g: allow only UDP traffic and discard anything else;
Live observation of a filtered set of packets flowing into a live system;

At different levels

cBPF packet filtering
Raw packet filtering (BPF_PROG_TYPE_SOCKET_FILTER)
Traffic control
XDP

71 / 88

cBPF and packet filtering

Packet filtering is done using an accumulator on which filters are applied, the classic BPF way;
One of the most popular use cases for it is tcpdump;
It doesn’t support the use of maps;

Tcpdump

Probably the most known use cases for live packets observation;
It is implemented as a frontend for libpcap;
Allows to define high level filtering expression that are then converted to a BPF filtering expression;
Tcpdump can dump the used BPF for user inspection;
Can read from an existing pcap file and filter on it

72 / 88

hands on: Tcpdump packet filtering

In a new terminal, execute tcpdump with a filter and use the -d option to dump the generated BPF assembly.

tcpdump  -d  'ip and tcp port 8080'

What do you see? Anything noteworthy?

73 / 88

tcpdump hands on: What is that stuff?

tcpdump explanation

74 / 88

Raw packets filtering

Attach a BPF program to a socket
All the packets received by it will be to the program as an sk_buff struct
The program can make the decision on whether to discard or allow them based on its logic

Here’s an example, the program type is given by the SEC("socket") definition that gets translated to BPF_PROG_TYPE_SOCKET_FILTER.

SEC("socket")
int socket_prog(struct __sk_buff *skb) {
  int proto = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
  int one = 1;
  int *el = bpf_map_lookup_elem(&countmap, &proto);
  if (el) {
    (*el)++;
  } else {
    el = &one;
  }
  bpf_map_update_elem(&countmap, &proto, el, BPF_ANY);
  return 0;
}

75 / 88

Traffic Control (tc) and eBPF

tc is the kernel packet scheduling subsystem;
It’s made of mechanisms and queuing systems that decide how packet flows and are accepted into the system;
It has a classifier that can use a bpf program to make the decisions, called cls_bpf;

Among tc use cases there are:

Prioritize certain kinds of packets
Drop specific kind of packet
Bandwidth distribution

76 / 88

Traffic Control cls_bpf hook points

cls_bpf can hook in ingress and egress
that means that you can manipulate both packets your machine receives and packets it sends!
programs receive an sk_buff

Here’s a diagram showing the interactions:

cls_bpf interactions

77 / 88

Example: TC program to drop all TCP packets

SEC("classifier")
static inline int classification(struct __sk_buff *skb) {
  void *data_end = (void *)(long)skb->data_end;
  void *data = (void *)(long)skb->data;
  struct ethhdr *eth = data;
  __u16 h_proto;
  __u64 nh_off = 0;
  nh_off = sizeof(*eth);
  if (data + nh_off > data_end) {
    return TC_ACT_OK;
  }
  h_proto = eth->h_proto;
  if (h_proto != bpf_htons(ETH_P_IP)) {
    return TC_ACT_OK;
  }
  struct iphdr *iph = data + nh_off;
  if (iph + 1 > data_end) {
    return TC_ACT_OK;
  }
  if (iph->protocol -= IPPROTO_TCP) {
    return TC_ACT_SHOT
  }
  return TC_ACT_OK;
}

The classifier program is added to the qdisc using tc:

tc filter add dev eth0 ingress bpf obj classifier.o flowid 0:

78 / 88

Xpress Data PathPrograms are of type BPF_PROG_TYPE_XDP
There are three operation modes:

Native: the network card driver supports XDP, code runs on the driver receive path;
Offloaded: the network card hardware supports XDP, the nic CPU will execute the logic;
Generic: It’s provided as a test mode for developers it’s for testing xdp programs without having the proper hardware;
Once packets are processed, XDP will return one of its possible codes:

XDP_DROP: drop the packet;
XDP_TX: forward the packet;
XDP_REDIRECT: similar to TX but forward to another nic or map of type CPU map;
XDP_PASS: allow the packet
79 / 88

Differences between TC and XDP

XDP programs are executed earlier in the ingress data path, before entering in the main kernel network stack;
Program does not have access to a Socket buffer struct sk_buff like with tc;
XDP programs instead take a different structure called xdp_buff that is an eager representation of the packet without metadata;

All this comes with advantages and disadvantages:

Being executed even before the kernel code, XDP programs can drop packets in a very efficient way. Compared to tc programs, XDP programs can only be attached to traffic in ingress to the system.

80 / 88

XDP packets processor

It executes BPF programs for XDP packets
Coordinates the interaction between them and the network stack
It ensures that packets are read and writeable and allows to attach post processing verdicts in the form of packet processor actions
The illustrated return codes before, are its return actions!

xdp packets processor

81 / 88

Example: XDP program to drop all TCP packets

SEC("mysection")
int myprogram(struct xdp_md *ctx) {
  int ipsize = 0;
  void *data = (void *)(long)ctx->data;
  void *data_end = (void *)(long)ctx->data_end;
  struct ethhdr *eth = data;
  struct iphdr *ip;
  ipsize = sizeof(*eth);
  ip = data + ipsize;
  ipsize += sizeof(struct iphdr);
  if (data + ipsize > data_end) {
    return XDP_DROP;
  }
  if (ip->protocol == IPPROTO_TCP) {
    return XDP_DROP;
  }
  return XDP_PASS;
}

It can be loaded on any interface using:

ip link set dev enp0s8 xdp obj udp.o sec mysection

82 / 88

Linux Kernel security and eBPF

Previous section | Back to table of contents | Next section

83 / 88

Seccomp

Stands for Secure Computing;
Implements a filtering backend based on cBPF
You can write a BPF program hat filters the execution of any syscall by allowing/disallowing the ones you want based on your logic;

Here’s the seccomp data structure for filters as from linux/seccomp.h

struct seccomp_data {
    int nr;
    __u32 arch;
    __u64 instruction_pointer;
    __u64 args[6];
};

Allows to filter based on: the system call, its arguments or a combination of them.

84 / 88

LSM HooksThe Linux security modules (LSM) framework, has a set of hooks to control the execution of (e)BPF programs,
Allows to create a fine-grained set of privileges around them when using a module that implements BPF hooks support
Actually implemented by Landlock and SELinux
The only in kernel tree implementation is SELinux
It’s based on the concept of hook calls instead of syscalls
85 / 88

hands on: Seccomp filters using bpf programsClone the exercise repository and cd into it

  git clone https://gist.github.com/fntlnz/08ae20befb91befd9a53cd91cdc6d507 seccomp-exercise
  cd seccomp-exercise


After following the instructions in README.md, what do you notice?
86 / 88

Credits

Thanks to all the eBPF authors and tools makers for their awesome work on it;
Many thanks to Jérôme Petazzoni, we adapted Jérôme’s template from container.training to Hugo, we also used the terminals setup instructions and the tmux cheatsheet from that deck!
Thanks to the remark authors for their work on it, it’s the tool we use to generate the slides;
Thanks to the hugo authors, for the awesome static site generator;

87 / 88

88 / 88

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help