Linux Kernel ICMPv6 & CVE-2023-6200

1. Overview

CVE-2023-6200 is a race condtion vulnerability within the linux kernel ICMPv6 subsystem. There is limited information available about this vulnerability, so I analyzed it as a practice to learn how to identify this kind of bug.

The patch commit can be found here, but it is merely a revert. This commit is the actual patch that further enhances the previous one.

P.S. I would like to thank the reporter (@_wmliang_) for answering some of my questions, helping me understand what I missed 🙂.

2. ICMPv6

2.1 Overview

ICMPv6 is the IPv6 version of ICMP. During booting, the kernel init function icmpv6_init() is called to register a protocol object [1] and a sender function [2]. The receive handler is icmpv6_rcv() [3], which is called when a packet is received.

static const struct inet6_protocol icmpv6_protocol = {
    .handler    =    icmpv6_rcv, // [3]
    // [...]
};

int __init icmpv6_init(void)
{
    // [...]
    inet6_add_protocol(&icmpv6_protocol, IPPROTO_ICMPV6); // [1]
    inet6_register_icmp_sender(icmp6_send); // [2]
    // [...]
}

The inet6_add_protocol() registers the protocol object inet6_protocol in the protocol table inet6_protos[].

int inet6_add_protocol(const struct inet6_protocol *prot, unsigned char protocol)
{
    return !cmpxchg((const struct inet6_protocol **)&inet6_protos[protocol],
            NULL, prot) ? 0 : -1;
}

This table is used by the function ip6_protocol_deliver_rcu(), invoked internally by the network interface callback __netif_receive_skb_core(), to determine which receive handler should be called.

void ip6_protocol_deliver_rcu(struct sk_buff *skb, int nexthdr, /*...*/)
{
    // [...]
    ipprot = rcu_dereference(inet6_protos[nexthdr]);
    ret = INDIRECT_CALL_2(ipprot->handler, tcp_v6_rcv, udpv6_rcv,
                      skb);
    // [...]
}

The __netif_receive_skb_core() uses packet’s type as an index to access the packet handler array ptype_base[] and deliver the packet.

static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
                    struct packet_type **ppt_prev)
{
    // [...]
    type = skb->protocol;
    deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
                    &ptype_base[ntohs(type) &
                        PTYPE_HASH_MASK]);
    // [...]
}

If the packet’s type is IPv6, the corresponding packet handler will be ipv6_packet_type.

static struct packet_type ipv6_packet_type __read_mostly = {
    .type = cpu_to_be16(ETH_P_IPV6),
    .func = ipv6_rcv,
    .list_func = ipv6_list_rcv,
};

In a nutshell, once a network device receives ICMPv6 packets, the simplified execution flow is as follows:

__netif_receive_skb_core() - netif callback
ipv6_rcv() - IPv6 packet receiver
ip6_protocol_deliver_rcu() - IPv6 protocol deliver
icmpv6_rcv() - ICMPv6 protocol receiver

Sending an ICMPv6 packet is quite easy: create a raw socket with type AF_INET6 and protocol IPPROTO_ICMPV6. Below is the backtrace from calling sys_sendmsg to xmitting a ICMPv6 packet to the network device.

rawv6_sendmsg()
ip6_append_data()
rawv6_push_pending_frames()
ip6_push_pending_frames()
ip6_send_skb()
ip6_local_out()
ip6_output()
ip6_finish_output()
__ip6_finish_output()
ip6_finish_output2()
rt6_nexthop()
neigh_output()
n->output()
neigh_resolve_output()
neigh_hh_init()
__dev_queue_xmit()

2.2 ICMPv6 Receive Handler & Allocation

The ICMPv6 receive handler, icmpv6_rcv(), handles data verification and parsing. The handler expects to receive a struct icmp6hdr object. After that, it determines which function should be called based on the ICMPv6 packet type.

static int icmpv6_rcv(struct sk_buff *skb)
{
    struct icmp6hdr *hdr;
    
    // [...]
    hdr = icmp6_hdr(skb);
    type = hdr->icmp6_type;
    
    switch (type) {
    case NDISC_ROUTER_ADVERTISEMENT:
        reason = ndisc_rcv(skb);
        break;
    // [...]
    }
}

If the type is NDISC_ROUTER_ADVERTISEMENT, it indicates that a device is discovering other devices, and the function ndisc_router_discovery() is called.

enum skb_drop_reason ndisc_rcv(struct sk_buff *skb)
{
    struct nd_msg *msg;
    
    // [...]
    msg = (struct nd_msg *)skb_transport_header(skb);
    switch (msg->icmph.icmp6_type) {
    // [...]
    case NDISC_ROUTER_ADVERTISEMENT:
        reason = ndisc_router_discovery(skb);
        break;
    // [...]
    }
}

The ndisc_router_discovery() first tries find the router object (struct fib6_info) of target device using source address [1]. If the router does not exist, the function then calls rt6_add_dflt_router() to create a new one [2].

static enum skb_drop_reason ndisc_router_discovery(struct sk_buff *skb)
{
    struct fib6_info *rt = NULL;

    // [...]
    rt = rt6_get_dflt_router(net, &ipv6_hdr(skb)->saddr, skb->dev); // [1]

    // [...]
    if (!rt && lifetime) {
        rt = rt6_add_dflt_router(net, &ipv6_hdr(skb)->saddr, // [2]
                         skb->dev, pref, defrtr_usr_metric);
        // [...]
    }
}

When creating a route object, the function rt6_add_dflt_router() first calls ip6_route_add() [3] to create a route object and store it in the default routing table. Then, the kernel calls rt6_get_dflt_router() [4] to retrieve the newly created route.

struct fib6_info *rt6_add_dflt_router(struct net *net,
                     const struct in6_addr *gwaddr,
                     struct net_device *dev,
                     unsigned int pref,
                     u32 defrtr_usr_metric)
{
    struct fib6_config cfg = {
        .fc_table    = /*...*/ ? : RT6_TABLE_DFLT,
        .fc_flags    = RTF_GATEWAY | RTF_ADDRCONF | RTF_DEFAULT |
                  RTF_UP | RTF_EXPIRES | RTF_PREF(pref),
        // [...]
    };

    // [...]
    ip6_route_add(&cfg, GFP_ATOMIC, NULL); // [3]
    
    // [...]
    return rt6_get_dflt_router(net, gwaddr, dev); // [4]
}

The function ip6_route_add() allocates a struct fib6_info object and initialize it using the configuration provided by the caller [5]. It then transfers ownership of the object to routing table [6] while holding the lock [7]. The fib6_add() function implements the table data structure and is complex. You can think of it as inserting a route object into the table [8].

int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
          struct netlink_ext_ack *extack)
{
    struct fib6_info *rt;
    int err;

    rt = ip6_route_info_create(cfg, gfp_flags, extack); // [5]
    err = __ip6_ins_rt(rt, &cfg->fc_nlinfo, extack); // [6]

    fib6_info_release(rt);
    return err;
}

static int __ip6_ins_rt(struct fib6_info *rt, struct nl_info *info,
            struct netlink_ext_ack *extack)
{
    struct fib6_table *table;

    table = rt->fib6_table;
    spin_lock_bh(&table->tb6_lock); // [7]
    err = fib6_add(&table->tb6_root, rt, info, extack); // [8]
    spin_unlock_bh(&table->tb6_lock);
}

The allocation and initialization function ip6_route_info_create() adds the new object to table’s GC list with an expiration time of 0 [9]. It means that kernel want to postpone setting the timer to a later time.

static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
                          gfp_t gfp_flags,
                          struct netlink_ext_ack *extack)
{
    // [...]
    rt = fib6_info_alloc(gfp_flags, !nh);
    
    // [...]
    if (cfg->fc_flags & RTF_EXPIRES)
        fib6_set_expires_locked(rt, jiffies + // [9]
                    clock_t_to_jiffies(cfg->fc_expires));
}

static inline void fib6_set_expires_locked(struct fib6_info *f6i,
                       unsigned long expires)
{
    struct fib6_table *tb6;

    tb6 = f6i->fib6_table;
    f6i->expires = expires;
    if (tb6 && !fib6_has_expires(f6i))
        hlist_add_head(&f6i->gc_link, &tb6->tb6_gc_hlist);
    f6i->fib6_flags |= RTF_EXPIRES;
}

static inline bool fib6_has_expires(const struct fib6_info *f6i)
{
    return f6i->fib6_flags & RTF_EXPIRES;
}

Returning to ndisc_router_discovery(), the function fib6_set_expires() is called to configure the expiration time if the packet contains lifetime information.

static enum skb_drop_reason ndisc_router_discovery(struct sk_buff *skb)
{
    // [...]
    lifetime = ntohs(ra_msg->icmph.icmp6_rt_lifetime);

    // [...]
    if (rt)
        fib6_set_expires(rt, jiffies + (HZ * lifetime));
}


static inline void fib6_set_expires(struct fib6_info *f6i,
                    unsigned long expires)
{
    spin_lock_bh(&f6i->fib6_table->tb6_lock);
    fib6_set_expires_locked(f6i, expires);
    spin_unlock_bh(&f6i->fib6_table->tb6_lock);
}

If the kernel is compiled with CONFIG_IPV6_ROUTE_INFO and CONFIG_IPV6_IPV6_ROUTER_PREF enabled, the function ndisc_router_discovery() will iterate through the route information list and call rt6_route_rcv() [10] before it finishes.

static enum skb_drop_reason ndisc_router_discovery(struct sk_buff *skb)
{
    // [...]
    if (in6_dev->cnf.accept_ra_rtr_pref && ndopts.nd_opts_ri) {
        struct nd_opt_hdr *p;
        for (p = ndopts.nd_opts_ri;
             p;
             p = ndisc_next_option(p, ndopts.nd_opts_ri_end)) {
            struct route_info *ri = (struct route_info *)p;
            // [...]
            rt6_route_rcv(skb->dev, (u8 *)p, (p->nd_opt_len) << 3, // [10]
                      &ipv6_hdr(skb)->saddr);
        }
    }
}

The function rt6_route_rcv() is used to set extra information in the route object, including lifetime. We will review its source code in a later section.

2.3 Release

The refcount of a route object is incremented by fib6_info_hold_safe() and decremented by fib6_info_release():

static inline bool fib6_info_hold_safe(struct fib6_info *f6i)
{
    return refcount_inc_not_zero(&f6i->fib6_ref);
}

static inline void fib6_info_release(struct fib6_info *f6i)
{
    if (f6i && refcount_dec_and_test(&f6i->fib6_ref))
        call_rcu(&f6i->rcu, fib6_info_destroy_rcu);
}

When the refcount of a route object reaches zero, the RCU free callback fib6_info_destroy_rcu() is invoked to release its resources.

void fib6_info_destroy_rcu(struct rcu_head *head)
{
    struct fib6_info *f6i = container_of(head, struct fib6_info, rcu);

    // release its fields
    
    // [...]
    kfree(f6i);
}

It is important to note that fib6_info_destroy_rcu() does not check whether the object is still present in the GC table. Therefore, if the kernel forgets to remove it from the table, it could lead to a UAF vulnerability.

Let’s examine how and when the kernel removes a route object from the tables. For the GC table, the removal handler is fib6_clean_expires_locked(). This function checks if a route object is being expiration-monitored. If it is, the function will unlink it from the GC table and unset the RTF_EXPIRES flag.

static inline void fib6_clean_expires_locked(struct fib6_info *f6i)
{
    if (fib6_has_expires(f6i))
        hlist_del_init(&f6i->gc_link);
    f6i->fib6_flags &= ~RTF_EXPIRES;
    f6i->expires = 0;
}

static inline bool fib6_has_expires(const struct fib6_info *f6i)
{
    return f6i->fib6_flags & RTF_EXPIRES;
}

For the routing table, the remove handler is ip6_del_rt(), which is a wrapper function of __ip6_del_rt(). The __ip6_del_rt() function first removes the target route object from routing table and then decrements its refcount.

int ip6_del_rt(struct net *net, struct fib6_info *rt, bool skip_notify)
{
    // [...]
    return __ip6_del_rt(rt, &info);
}

static int __ip6_del_rt(struct fib6_info *rt, struct nl_info *info)
{
    struct net *net = info->nl_net;
    struct fib6_table *table;
    int err;

    table = rt->fib6_table;
    spin_lock_bh(&table->tb6_lock);
    err = fib6_del(rt, info);
    spin_unlock_bh(&table->tb6_lock);

    fib6_info_release(rt);
    return err;
}

To avoid leaving the route object in the GC table, the kernel not only unlinks it from the routing table [1] but also removes it from the GC table [2].

int fib6_del(struct fib6_info *rt, struct nl_info *info)
{
    // [...]
    for (rtp = &fn->leaf; *rtp; rtp = rtp_next) {
        struct fib6_info *cur = rcu_dereference_protected(*rtp,
                    lockdep_is_held(&table->tb6_lock));
        if (rt == cur) {
            // [...]
            fib6_del_route(table, fn, rtp, info);
            return 0;
        }
        rtp_next = &cur->fib6_next;
    }
}

static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn,
               struct fib6_info __rcu **rtp, struct nl_info *info)
{
    // [...]
    *rtp = rt->fib6_next; // [1]
    rt->fib6_node = NULL;

    // [...]
    fib6_purge_rt(rt, fn, net);
}

static void fib6_purge_rt(struct fib6_info *rt, struct fib6_node *fn,
              struct net *net)
{
    // [...]
    fib6_clean_expires_locked(rt); // [2]
}

2.4 Garbage collection (GC)

In fact, this vulnerability isn’t related to the GC callback, but it is still worth introducing how it works.

The GC timer is set up during the creation of network namespace.

static int __net_init fib6_net_init(struct net *net)
{
    // [...]
    timer_setup(&net->ipv6.ip6_fib_timer, fib6_gc_timer_cb, 0);
    // [...]
}

As the timer ticks, the GC callback fib6_gc_timer_cb() is called to release the expired route objects. It iterates through the GC table, checks the age of each route object [1], and deletes [2] it from the routing table if it has expired.

static void fib6_gc_table(struct net *net,
              struct fib6_table *tb6,
              struct fib6_gc_args *gc_args)
{
    struct fib6_info *rt;
    struct hlist_node *n;
    
    // [...]
    hlist_for_each_entry_safe(rt, n, &tb6->tb6_gc_hlist, gc_link)
        if (fib6_age(rt, gc_args) == -1) // [1]
            fib6_del(rt, &info); // [2]
}

The function fib6_age() compares current time and the expiration of route object. If the route object is expired, this function will return -1 and trigger fib6_del().

2.4 Restriction

Although you can simply send an ICMPv6 packet to trigger the receive handler icmpv6_rcv(), the packet must meet certain constraints to reach deeper functions. In the ndisc_rcv() function, it checks whether the packet’s hop limit is equal to 255.

enum skb_drop_reason ndisc_rcv(struct sk_buff *skb)
{
    // [...]    
    if (ipv6_hdr(skb)->hop_limit != 255) {
        return SKB_DROP_REASON_IPV6_NDISC_HOP_LIMIT;
    }
}

The hop limit is set in the function __ip6_make_skb(), and the v6_cork->hop_limit is set in the function ip6_setup_cork().

struct sk_buff *__ip6_make_skb(struct sk_buff_head *queue,
                   struct inet6_cork *v6_cork, /* ... */)
{
    struct ipv6hdr *hdr;

    skb = __skb_dequeue(queue);
    // [...]
    hdr = ipv6_hdr(skb);
    hdr->hop_limit = v6_cork->hop_limit;
}

static int ip6_setup_cork(struct inet6_cork *v6_cork, struct ipcm6_cookie *ipc6, /*...*/)
{
    // [...]
    v6_cork->hop_limit = ipc6->hlimit;
}

The function ip6_setup_cork() is called by ip6_append_data() [1], and the cookie object ipc6 is passed by the caller function rawv6_sendmsg() [2]. The hop limit of socket has a default value, but we can specify the value via control messages [3].

int ip6_append_data(struct sock *sk, struct ipcm6_cookie *ipc6, /*...*/)
{
    struct inet_sock *inet = inet_sk(sk);
    struct ipv6_pinfo *np = inet6_sk(sk);
    // [...]
    err = ip6_setup_cork(sk, &inet->cork, &np->cork, // [1]
                     ipc6, rt);
}

static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
{
    struct ipcm6_cookie ipc6;
    struct ipv6_pinfo *np = inet6_sk(sk);

    ipcm6_init(&ipc6); // hlimit = -1

    if (msg->msg_controllen) {
        // [...]
        err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6, &ipc6); // [3]
    }

    if (ipc6.hlimit < 0)
        ipc6.hlimit = ip6_sk_dst_hoplimit(np, &fl6, dst);

    // [...]
    err = ip6_append_data(sk, raw6_getfrag, &rfv,
            len, 0, &ipc6, &fl6, (struct rt6_info *)dst, // [2]
            msg->msg_flags);
}

The function ip6_datagram_send_ctl() is used to handle control messages. By providing a control message with the level SOL_IPV6 and type IPV6_HOPLIMIT, we can set the hlimit of the cookie object to 255 [4].

int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
              struct msghdr *msg, struct flowi6 *fl6,
              struct ipcm6_cookie *ipc6)
{
    // [...]
    for_each_cmsghdr(cmsg, msg) {
        // [...]
        if (cmsg->cmsg_level != SOL_IPV6)
            continue;
        switch (cmsg->cmsg_type) {
        case IPV6_HOPLIMIT:
            // [...]
            ipc6->hlimit = *(int *)CMSG_DATA(cmsg); // [4]
        }
    }

}

Now we can satisfy the hot limit requirement of ndisc_rcv().

For ndisc_router_discovery(), the packet must contain option information [5], and the receiving device cannot be the same as the sending device [6].

static enum skb_drop_reason ndisc_router_discovery(struct sk_buff *skb)
{
    // [...]
    in6_dev = __in6_dev_get(skb->dev);

    // [...]
    if (!ndisc_parse_options(skb->dev, opt, optlen, &ndopts)) // [5]
        return SKB_DROP_REASON_IPV6_NDISC_BAD_OPTIONS;

    // [...]
    if (!in6_dev->cnf.accept_ra_from_local &&
        ipv6_chk_addr(net, &ipv6_hdr(skb)->saddr, in6_dev->dev, 0)) { // [6]
        goto skip_defrtr;
    }

    // [...]
}

This can be easily satisfied by creating two or more devices on the same network. I used the following command to set up the environment.

#!/bin/sh
ip link add veth1 type veth peer name veth2
ip link set veth1 up
ip link set veth2 up
ip -6 addr add 2001:db8::1/64 dev veth1
ip -6 addr add 2001:db8::2/64 dev veth2

Due to the device creation, you may first need to enter a new namespace.

unshare -r -n -m /bin/bash

3. CVE-2023-6200

At first glance, I thought it was just a lock-missed race condition: the GC callback iterates through the table without a lock, or the receiver handler does not hold the lock before updating the GC table.

However, both the GC callback and the receiver handler takes the lock properly. I couldn’t imagine what the race condition was, so I spent a lot of time trying to understand how the kernel handles ICMPv6 packet and updates the refcount of route object.

Unfortunately, I got stuck trying to determine when the ip6_del_rt() function is called and by whom, even after reviewing the diagram provided in the RedHat report.

Finally, I asked the reporter Lucas (@_wmliang_) for some hints, and he provided me with a complete race diagram. The flow below is based on his diagram, with additional information I’ve added:

The function rt6_route_rcv() can be used to set the expiration time of a route object. It first attempts to retrieve the target route object by calling rt6_get_dflt_router() [1]. If the route object doesn’t exist, it then calls rt6_add_route_info() [2] to create a new one. Due to lazy expiration updates, there is a race window between when the route object becomes accessible and when it is added to the GC table.

int rt6_route_rcv(struct net_device *dev, u8 *opt, int len,
          const struct in6_addr *gwaddr)
{
    struct route_info *rinfo = (struct route_info *) opt;
    struct fib6_info *rt;

    lifetime = addrconf_timeout_fixup(ntohl(rinfo->lifetime), HZ);

    // [...]
    if (rinfo->prefix_len == 0)
        rt = rt6_get_dflt_router(net, gwaddr, dev); // [1]

    // [...]
    if (!rt && lifetime)
        rt = rt6_add_route_info(net, prefix, rinfo->prefix_len, gwaddr, // [2]
                    dev, pref);

    // -------------- the race window --------------

    if (rt) {
        if (!addrconf_finite_timeout(lifetime))
            fib6_clean_expires(rt);
        else
            fib6_set_expires(rt, jiffies + HZ * lifetime);

        fib6_info_release(rt);
    }
}

There is no difference between the default router table [3] and router info table [4].

#define RT6_TABLE_MAIN     RT_TABLE_MAIN
#define RT6_TABLE_DFLT     RT6_TABLE_MAIN // [3]
#define RT6_TABLE_INFO     RT6_TABLE_MAIN // [4]

If another thread calls the ip6_del_rt() function during this race window, the route object will be removed from the both routing table and the GC table. Later, the fib6_set_expires() call will add the route object into GC table again, leading to an incorrect situation.

The route object destructor, fib6_info_destroy_rcu(), assumes the object is not in the GC table, so it only releases the object’s fields and the object itself. As a result, the freed route object may still be linked in the GC table.

4. Conclusion

I didn’t fully reproduce it because I spent too much time on it, but I learned a lot and found it very interesting while attempting to reproduce it. Even if locking and unlocking are properly handled when accessing shared objects, race condition vulnerabilities can still arise at the design level.

Here is the partial POC code to trigger the icmpv6_rcv() functionality. Thank you for reading.