eBPF：启动并运行这是使用 C 和 Golang 编写第一个 eBPF 程序的简单介绍。我们将在第一部分介绍实际的 e

这是使用 C 和 Golang 编写第一个 eBPF 程序的简单介绍。我们将在第一部分介绍实际的 eBPF 程序，在第二部分介绍用户空间应用程序。

请注意，在使用较低级别的技术时，它高度依赖于运行它的基础设施。因此，为了透明度，我将运行以下操作：

OS: Ubuntu 22.04  
Linux Header Version: 6.5.0–14-generic

我还通过 APT 安装了一些依赖项

sudo apt-get -y install libbpf bpfcc-tools

写这篇文章的前提是任何阅读它的人都对 C 编程有基本的了解。

有许多博客/网站深入探讨了 eBPF 是什么（查看资源部分），但为了简单起见，我们假设 eBPF 是一种使用模块扩展 Linux 内核而不更改内核源代码的方法。

目前，我认为 eBPF 作为内核的一个钩子，允许逻辑在内核空间中运行。

用户空间与内核空间

当我们谈论内核空间时，我们通常谈论操作系统。这是一个特权区域，可以完全访问硬件和软件资源。当我们谈论用户空间时，这通常是您运行 Google Chrome 等日常程序的地方。用户空间对其可以访问的内容有限制。

选择要挂钩的事件

我当前学习 eBPF 的用例是了解Kepler的工作原理。 Kepler 所做的一件事是通过称为CPU 调度切换的方式计算每个进程（由 PID 标识）使用 CPU 的时间。

CPU调度是指在正在执行的进程之间进行切换，以更好地利用处理能力（当一个进程被阻塞时，CPU暂停处理它并切换到另一个进程）。

因此，如果我们想（简单地）复制此功能，我们将执行以下操作：

了解进程何时将开始使用 CPU
了解进程何时停止使用 CPU
计算这两个时刻之间的时间

这应该可以让我们粗略估计每个进程需要多少时间，记住一个进程会被多次调度。

因此，有了上述信息，我们将需要：

tracepoint/sched/sched_switch

安排进程时要通知的事件。

获取事件的格式

在 BPF 事件中，每个事件都会使用称为“上下文”的东西来运行函数。这些上下文本质上是事件发出的信息。我们需要定义一个 C 结构来保存这些信息，但首先，我们需要获取该结构的格式。我们可以通过运行以下命令来做到这一点：

$ sudo cat /sys/kernel/debug/tracing/events/sched/sched_switch/format  
name: sched_switch  
ID: 327  
format:  
field:unsigned short common_type; offset:0; size:2; signed:0;  
field:unsigned char common_flags; offset:2; size:1; signed:0;  
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;  
field:int common_pid; offset:4; size:4; signed:1;  
field:char prev_comm[16]; offset:8; size:16; signed:0;  
field:pid_t prev_pid; offset:24; size:4; signed:1;  
field:int prev_prio; offset:28; size:4; signed:1;  
field:long prev_state; offset:32; size:8; signed:1;  
field:char next_comm[16]; offset:40; size:16; signed:0;  
field:pid_t next_pid; offset:56; size:4; signed:1;  
field:int next_prio; offset:60; size:4; signed:1;  
print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, (REC->prev_state & ((((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) - 1)) ? __print_flags(REC->prev_state & ((((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) - 1), "|", { 0x00000001, "S" }, { 0x00000002, "D" }, { 0x00000004, "T" }, { 0x00000008, "t" }, { 0x00000010, "X" }, { 0x00000020, "Z" }, { 0x00000040, "P" }, { 0x00000080, "I" }) : "R", REC->prev_state & (((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio

这是需要处理的大量信息，但为了简单起见，我们不需要关心common_此用例中以为前缀的任何字段。这给我们留下了以下字段：

char prev_comm[16];  
pid_t prev_pid;  
int prev_prio;  
long prev_state;  
char next_comm[16];  
pid_t next_pid;  
int next_prio;

然后我们可以使用这些信息来创建以下 C 结构体：

struct sched_switch_args {  
char prev_comm[16];  
int prev_pid;  
int prev_prio;  
long prev_state;  
char next_comm[16];  
int next_pid;  
int next_prio;  
};

现在，首先要注意的是我将类型更改pid_t为int.这只是因为的基础类型pid_t是int.我们可以使用 type pid_t，但是，我们需要包含对的依赖项sys/types.h，在本例中我们不需要。您可以在这里阅读更多相关内容。

创建 BPF 映射

为了从内核空间收集数据并在用户空间访问它，我们需要使用称为BPF Map的东西。 BPF 映射是推送到用户空间的数据结构。在我们的例子中，我们将使用基于 PID 的哈希图类型。这将要求我们创建三个结构，即：

识别数据的关键结构

struct key_t {  
// This is the process ID  
// which we will use to identify  
// in the hash map  
__u32 pid;  
};

值结构是我们存储数据的方式

struct val_t {  
// used to understand the start time of the process  
__u64 start_time;  
// used to store the elapsed time of the process  
__u64 elapsed_time;  
};

BPF 哈希图将它们连接在一起

struct {  
// The type of BPF map we are creating  
__uint(type, BPF_MAP_TYPE_HASH);  
// specifying the type to be used for the key  
__type(key, struct key_t);  
// specifying the type to be used as the value  
__type(value, struct val_t);  
// max amount of entries to store in the map  
__uint(max_entries, 10240);  
// name of the map as well as a section macro  
// from the bpf lib to designate this type  
// as a BPF map  
} process_time_map SEC(".maps");

我添加了注释来解释这些结构的每一行的用途。

创建我们的 eBPF 功能

谜题的最后一条线索是创建实际的功能。为此，我们需要一个 eBPF 程序。

这是一个 C 函数，带有一些宏来标识它，以便我们可以使用之前定义的类型进行交互，例如：

SEC("tracepoint/sched/sched_switch")  
int cpu_processing_time(struct sched_switch_args *ctx) {  
// get the current time in ns  
__u64 ts = bpf_ktime_get_ns();  
// we need to check if the process is in our map  
struct key_t prev_key = {  
.pid = ctx->prev_pid,  
};  
struct val_t *val = bpf_map_lookup_elem(&process_time_map, &prev_key);  
// if the previous PID does not exist it means that we just started  
// watching or we missed the start somehow  
// so we ignore it for now  
if (val) {  
// Calculate and store the elapsed time for the process and we reset the  
// start time so we can measure the next cycle of that process  
__u64 elapsed_time = ts - val->start_time;  
struct val_t new_val = {.start_time = ts, .elapsed_time = elapsed_time};  
bpf_map_update_elem(&process_time_map, &prev_key, &new_val, BPF_ANY);  
return 0;  
};  
// we need to check if the next process is in our map  
// if it's not we need to set initial time  
struct key_t next_key = {  
.pid = ctx->next_pid,  
};  
struct val_t *next_val = bpf_map_lookup_elem(&process_time_map, &prev_key);  
if (!next_val) {  
struct val_t next_new_val = {.start_time = ts};  
bpf_map_update_elem(&process_time_map, &next_key, &next_new_val, BPF_ANY);  
return 0;  
}  
return 0;  
}

您会注意到一般的编码逻辑，但我想请您注意几行非常重要的行。

SEC("tracepoint/sched/sched_switch")

该宏指定该函数将附加到哪个事件。

struct  val_t * val = bpf_map_lookup_elem(&process_time_map, &prev_key);

这条线是我们查看 BPF 地图数据的方式。我们使用一个唯一的键并将其传递给bpf_map_lookup_elem将返回类型值val_t（我们之前定义的）的函数。现在，如果该键下没有值，该函数将返回NULL，请注意我们需要如何将 BPF 映射类型传递为&process_time_map

bpf_map_update_elem(&process_time_map, &prev_key, &new_val, BPF_ANY);

这行代码就是我们向 BPF 映射添加数据的方式。我们正在传递我们的键（在本例中&prev_key）和键的值（&new_val），它将将该值存储在我们的 BPF 映射中。再次注意，我们传递了地图类型。BPF_ANY用于将密钥更新为新值或创建它（如果不存在）（请参阅文档）。

至此，我们现在已经完成了该功能，但是，我们仍然需要在代码中添加最后一行：

char _license[] SEC("license") = "Dual MIT/GPL";

由于 eBPF 在 GPL 下获得许可，这意味着所有集成的软件也需要兼容 GPL。如果没有这一行，您将无法将代码加载到内核中。

所以我们最终的代码片段如下所示（我添加了需要包含的C头文件）

#include <linux/sched.h>  
#include <linux/bpf.h>  
#include <bpf/bpf_helpers.h>  
#include <bpf/bpf_tracing.h>  
#include <stddef.h>  
  
#ifndef TASK_COMM_LEN  
#define TASK_COMM_LEN 16  
#endif  
  
struct key_t {  
__u32 pid;  
};  
  
struct val_t {  
__u64 start_time;  
__u64 elapsed_time;  
};  
  
struct {  
__uint(type, BPF_MAP_TYPE_HASH);  
__type(key, struct key_t);  
__type(value, struct val_t);  
__uint(max_entries, 10240);  
} process_time_map SEC(".maps");  
  
// this is the structure of the sched_switch event  
struct sched_switch_args {  
char prev_comm[TASK_COMM_LEN];  
int prev_pid;  
int prev_prio;  
long prev_state;  
char next_comm[TASK_COMM_LEN];  
int next_pid;  
int next_prio;  
};  
  
SEC("tracepoint/sched/sched_switch")  
int cpu_processing_time(struct sched_switch_args *ctx) {  
// get the current time in ns  
__u64 ts = bpf_ktime_get_ns();  
// we need to check if the process is in our map  
struct key_t prev_key = {  
.pid = ctx->prev_pid,  
};  
struct val_t *val = bpf_map_lookup_elem(&process_time_map, &prev_key);  
// if the previous PID does not exist it means that we just started  
// watching or we missed the start somehow  
// so we ignore it for now  
if (val) {  
// Calculate and store the elapsed time for the process and we reset the  
// start time so we can measure the next cycle of that process  
__u64 elapsed_time = ts - val->start_time;  
struct val_t new_val = {.start_time = ts, .elapsed_time = elapsed_time};  
bpf_map_update_elem(&process_time_map, &prev_key, &new_val, BPF_ANY);  
return 0;  
};  
// we need to check if the next process is in our map  
// if it's not we need to set initial time  
struct key_t next_key = {  
.pid = ctx->next_pid,  
};  
struct val_t *next_val = bpf_map_lookup_elem(&process_time_map, &prev_key);  
if (!next_val) {  
struct val_t next_new_val = {.start_time = ts};  
bpf_map_update_elem(&process_time_map, &next_key, &next_new_val, BPF_ANY);  
return 0;  
}  
return 0;  
}  
  
char _license[] SEC("license") = "Dual MIT/GPL";

这样，我们就完成了 BPF 程序。在本系列的下一篇文章中，我将介绍如何使用名为bpf2go的出色工具在 GO 中编写用户空间程序来帮助我们进行绑定。