So far, there are two vulnerability whose root causes lie in the internal memory subsystem implementation exploited in kernelCTF. I have reviewed both of them and will share my analysis in this post.

Note: These two CVEs have already had public reports from the authors, and their analyses might be clearer than mine.

1. CVE-2024-50066

This CVE is a race condition vulnerability between remapping and memory advising. The patch commit is here. Because the issue in Project Zero described this vulnerability in detail, I just noted the code tracing in this post.

1.1. sys_mremap

The mremap system call is used to remap a virtual memory address. The syscall entry __do_sys_mremap() initially acquires a write lock (mm->mmap_lock) at the start [1].

SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
        unsigned long, new_len, unsigned long, flags,
        unsigned long, new_addr)
{
    // [...]
    if (mmap_write_lock_killable(current->mm)) // [1]
        return -EINTR;

    // [...]
    if (flags & (MREMAP_FIXED | MREMAP_DONTUNMAP)) {
        ret = mremap_to(addr, old_len, new_addr, new_len, // <-----------
                &locked, flags, &uf, &uf_unmap_early,
                &uf_unmap);
        goto out;
    }
    // [...]
}

static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
        unsigned long new_addr, unsigned long new_len, bool *locked,
        unsigned long flags, struct vm_userfaultfd_ctx *uf,
        struct list_head *uf_unmap_early,
        struct list_head *uf_unmap)
{
    // [...]
    vma = vma_to_resize(addr, old_len, new_len, flags);
    
    // [...]
    ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, flags, uf, // <-----------
               uf_unmap)
    // [...]
}

static unsigned long move_vma(struct vm_area_struct *vma,
        unsigned long old_addr, unsigned long old_len,
        unsigned long new_len, unsigned long new_addr,
        bool *locked, unsigned long flags,
        struct vm_userfaultfd_ctx *uf, struct list_head *uf_unmap)
{
    // [...]
    moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len, // <-----------
                     need_rmap_locks, false);
    // [...]
}

To speed up, the move_page_tables() function attempts to move higher-level page table entries. The the hierarchy of page tables is as follows:

PGD (--> P4D) --> PUD --> PMD --> PTE
(39)             (30)    (21)    (12)

Note: The kernelCTF environment is using 4-level page tables, as CONFIG_X86_5LEVEL config is disabled.

First, it checks whether the remapping corresponds to a PUD-sized region [2]. If not, it proceeds to whether it is a PMD-sized remapping [3].

unsigned long move_page_tables(struct vm_area_struct *vma,
        unsigned long old_addr, struct vm_area_struct *new_vma,
        unsigned long new_addr, unsigned long len,
        bool need_rmap_locks, bool for_stack)
{
    // [...]
    for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
        // handle PUD case
        // [...]
        else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE /* 1 << 30 */) {  // [2]
            if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
                       old_pud, new_pud, true))
                continue;
        }
        
        // handle PMD case
        extent = get_extent(NORMAL_PMD, old_addr, old_end, new_addr);
        old_pmd = get_old_pmd(vma->vm_mm, old_addr);
        new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);

        // [...]
        else if (IS_ENABLED(CONFIG_HAVE_MOVE_PMD) &&
               extent == PMD_SIZE /* 1 << 21 */) { // [3]
            // [...]
            if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr,
                       old_pmd, new_pmd, true))
                continue;
        }
        // [...]
    }
}

The move_pgt_entry() function is called if the kernel decides to move the PUD or PMD page table entry. Within this function, move_normal_pmd() is called [4] while holding the reverse mapping (rmap) lock [5].

static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
            unsigned long old_addr, unsigned long new_addr,
            void *old_entry, void *new_entry, bool need_rmap_locks)
{
    if (need_rmap_locks)
        take_rmap_locks(vma); // [5]

    switch (entry) {
    case NORMAL_PMD:
        moved = move_normal_pmd(vma, old_addr, new_addr, old_entry, // [4]
                    new_entry);
        break;
    // [...]
    }
}

If the memory is mapped from a file, the rmap lock used will be mapping->i_mmap_rwsem.

static void take_rmap_locks(struct vm_area_struct *vma)
{
    if (vma->vm_file)
        i_mmap_lock_write(vma->vm_file->f_mapping); // <------------
    // [...]
}

static inline void i_mmap_lock_write(struct address_space *mapping)
{
    down_write(&mapping->i_mmap_rwsem); // <------------
}

The move_normal_pmd() function performs several operations in sequence:

  1. It holds the spin locks for both old and new PMDs [6]
  2. Then, it copies and clears ther PMDs [7].
  3. Finally, it releases the locks before returning [8].
static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
          unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
{
    spinlock_t *old_ptl, *new_ptl;
    struct mm_struct *mm = vma->vm_mm;
    pmd_t pmd;

    // [...]
    // [6]
    old_ptl = pmd_lock(vma->vm_mm, old_pmd);
    new_ptl = pmd_lockptr(mm, new_pmd);
    if (new_ptl != old_ptl)
        spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);

    // [7]
    pmd = *old_pmd;
    pmd_clear(old_pmd);
    pmd_populate(mm, new_pmd, pmd_pgtable(pmd));

    // [8]
    if (new_ptl != old_ptl)
        spin_unlock(new_ptl);
    spin_unlock(old_ptl);

    return true;
}

static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
{
    spinlock_t *ptl = pmd_lockptr(mm, pmd);
    spin_lock(ptl);
    return ptl;
}

1.2. madvise(MADV_COLLAPSE)

The madvise system call is used to give advice about use of memory. Among the various advices, the MADV_COLLAPSE is new addition, introduced since Linux 6.1. The do_madvise() function acquires either read or write lock depending on the given advice. For instance, the read lock of mm is held [1] when the advice is MADV_COLLAPSE [2]. Subsequently, the madvise_walk_vmas() is called with the iteration callback madvise_vma_behavior() [3].

SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
{
    return do_madvise(current->mm, start, len_in, behavior); // <------------
}

int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
{
    // [...]
    write = madvise_need_mmap_write(behavior);
    if (write) {
        if (mmap_write_lock_killable(mm))
            return -EINTR;
    } else {
        mmap_read_lock(mm); // [1]
    }
    
    // [...]
    error = madvise_walk_vmas(mm, start, end, behavior, // [3]
            madvise_vma_behavior);
    // [...]
}

static int madvise_need_mmap_write(int behavior)
{
    switch (behavior) {
    // [...]
    case MADV_COLLAPSE: // [2]
        return 0;
    // [...]
    }
}

The madvise_walk_vmas() function iterates all virtual memory areas (VMAs) until the end address is reached [4]. For each vma object, it invokes the iteration callback provided as the visit parameter [5].

static
int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
              unsigned long end, unsigned long arg,
              int (*visit)(struct vm_area_struct *vma,
                   struct vm_area_struct **prev, unsigned long start,
                   unsigned long end, unsigned long arg))
{
    vma = find_vma_prev(mm, start, &prev);
    if (vma && start > vma->vm_start)
        prev = vma;

    for (;;) {
        // [...]
        error = visit(vma, &prev, start, tmp, arg); // [5]
        // [...]
        if (start >= end) // [4]
            break;
        if (prev)
            vma = find_vma(mm, prev->vm_end);
    }
}

The madvise_vma_behavior() function contains a large switch-case that dispatches the advise (referred to as the behavior variable in the source code) to the corresponding handler. If the advice is MADV_COLLAPSE, the madvise_collapse() function [6] is invoked to handle it.

static int madvise_vma_behavior(struct vm_area_struct *vma,
                struct vm_area_struct **prev,
                unsigned long start, unsigned long end,
                unsigned long behavior)
{
    switch (behavior) {
    case MADV_REMOVE:
        return madvise_remove(vma, prev, start, end);
    // [...]
    case MADV_COLLAPSE:
        return madvise_collapse(vma, prev, start, end); // [6]
    // [...]
    }
}

Before diving into the madvise_collapse() function, it’s important to understand Transparent Huge Pages (THP). The official documentation provides a detailed explanation on it. In brief, THP is a feature that allows processes to dynamically use huge pages for memory regions without requiring prior reservation of 2MB pages. If allocating a 2MB page is not currently possible, the kernel simply falls back to regular pages, leaving the memory region unaffected.

The madvise_collapse() function is responsible for collapsing a given virtual memory range to THPs. If the target memory is mapped from a file, it releases the read lock and invokes the hpage_collapse_scan_file() function [7]. The hpage_collapse_scan_file() will return SCAN_PTE_MAPPED_HUGEPAGE, and collapse_pte_mapped_thp() is called [8] to collapse a pte-mapped THP.

int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
             unsigned long start, unsigned long end)
{
    // [...]
    hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
    hend = end & HPAGE_PMD_MASK;

    // [...]
    for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE /* 1 << PMD_SHIFT (21) */) {
        // [...]
        if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
            struct file *file = get_file(vma->vm_file);
            pgoff_t pgoff = linear_page_index(vma, addr);

            mmap_read_unlock(mm); // read lock is held in `do_madvise()`
            mmap_locked = false;
            result = hpage_collapse_scan_file(mm, addr, file, pgoff, // [7]
                              cc);
            fput(file);
        } 

        switch (result) {
        // [...]
        case SCAN_PTE_MAPPED_HUGEPAGE:
            // [...]
            mmap_read_lock(mm);
            result = collapse_pte_mapped_thp(mm, addr, true); // [8]
            mmap_read_unlock(mm);
            goto handle_result;
        }
        // [...]
    }
    // [...]
}

I am not sure about the full functionality of the hpage_collapse_scan_file() function. It performs some checks within a foreach block, and if passing all verifications, it finally calles the collapse_file() function [9].

static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
                    struct file *file, pgoff_t start,
                    struct collapse_control *cc)
{
    struct page *page = NULL;
    struct address_space *mapping = file->f_mapping;
    XA_STATE(xas, &mapping->i_pages, start);
    int result = SCAN_SUCCEED;
    // [...]
    
    xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
        // [...]
        node = page_to_nid(page);
        // [...]
        cc->node_load[node]++;
        // [...]
        present++;
    }

    if (result == SCAN_SUCCEED) {
        // [...]
        else {
            result = collapse_file(mm, addr, file, start, cc); // [9]
        }
    }
}

The collapse_file() function collapses filemap/tmpfs/shmem pages into huge one. Since it is a highly complex function, we will omit most of the details here. In short, it allocates a new folio object (new_folio), which represents 512 (1 << HPAGE_PMD_ORDER) contiguous pages of a THP [10]. Next, it updates the folio’s refcount [11] to 512 and sets the mapping->i_pages xarray to use the new folio [12]. The retract_page_tables() function [13] is then called to remove or unmap the page table entries. Finally, collapse_file() iterates through all the old pages and releases them [14].

static int collapse_file(struct mm_struct *mm, unsigned long addr,
             struct file *file, pgoff_t start,
             struct collapse_control *cc)
{
    struct address_space *mapping = file->f_mapping;
    struct folio *folio, *new_folio;
    XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);

    // [...]
    result = alloc_charge_folio(&new_folio, mm, cc); // [10]
    new_folio->index = start;
    new_folio->mapping = mapping;

    // [...]
    folio_ref_add(new_folio, HPAGE_PMD_NR - 1); // [11], HPAGE_PMD_NR == 512
    
    // [...]

    // [12] set entries 0 (start) ~ 512 (2 ^ HPAGE_PMD_ORDER) to point to new_folio
    xas_set_order(&xas, start, HPAGE_PMD_ORDER);
    xas_store(&xas, new_folio);

    // [...]
    retract_page_tables(mapping, start); // [13]
    result = SCAN_PTE_MAPPED_HUGEPAGE;

    // [...]
    list_for_each_entry_safe(page, tmp, &pagelist, lru) {
        list_del(&page->lru);
        page->mapping = NULL;
        // [...]
        folio_put_refs(page_folio(page), 3); // [14]
    }

    goto out;
    // [...]
}

The retract_page_tables() function iterates vma of the file mappings and handles the removal of PTEs. It first acquires read lock of mapping object [15] and then calls pmdp_collapse_flush() to clear the PMD entry and flush it from TLB [16].

static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
{
    struct vm_area_struct *vma;

    i_mmap_lock_read(mapping); // [15]
    vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
        // [...]
        if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
            continue;
        // [...]
        pgt_pmd = pmdp_collapse_flush(vma, addr, pmd); // [16]
        // [...]
    }
    i_mmap_unlock_read(mapping);
}

static inline void i_mmap_lock_read(struct address_space *mapping)
{
    down_read(&mapping->i_mmap_rwsem);
}

The functions down_write() and down_read() are APIs for the reader-writer semaphore, which operate based on the following rules:

  1. Readers and writers cannot hold the lock simultaneously.
  2. Multiple readers can hold the (read) lock concurrently.
  3. Only one writer can hold the (write) lock at any given time.

1.3. The Problem

Both functions retract_page_tables() and move_pgt_entry() acquire the lock of same mapping object. However, move_pgt_entry() doesn’t check whether the target PMD is still present or not.

The following diagram is adapted from the original report, I just simplify it and add some comments on it.

process A                           process B
=========                           =========
                                    retract_page_tables
                                      i_mmap_lock_read(mapping) <---- reader lock
                                      pmdp_collapse_flush
                                        clear PMD
                                      i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, ...)
  take_rmap_locks
    i_mmap_lock_write(vma->vm_file->f_mapping) <---- writer lock
  move_normal_pmd
    get PMD
  drop_rmap_locks

1.4. Patch

The move_pgt_entry() should ensure that the PMD is present. If not, it should take no action nothing and return an error.

diff --git a/mm/mremap.c b/mm/mremap.c
index 24712f8dbb6b5c..dda09e957a5d4c 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -238,6 +238,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
// [...]
     pmd = *old_pmd;
+    if (unlikely(!pmd_present(pmd) || pmd_leaf(pmd)))
+        goto out_unlock;
// [...]

2. CVE-2023-3269

This CVE is UAF during stack expansion, and the patch commit is here. This vulnerability is named StackRot by its author, who has also provided very detailed information and PoC in the github repo. Highly recommend readers to see his report for more details.

2.1. Stack

If a sys_mmap call includes the flag MAP_GROWSDOWN, it indicates the memory region is being used as a stack. A stack has a special property: it automatically extends when its bottom is reached.

The do_user_addr_fault() function handles page faults occuring in userspace. It first acquires the read lock mm->mmap_lock [1] and then attempts to expand the memory region [2] if the VM_GROWSDOWN of vma object is set [3].

static inline
void do_user_addr_fault(struct pt_regs *regs,
            unsigned long error_code,
            unsigned long address)
{
    // [...]
    mmap_read_trylock(mm); // [1]

    // [...]
    vma = find_vma(mm, address);
    
    // [...]
    if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { // [3]
        return;
    }

    if (unlikely(expand_stack(vma, address))) { // [2]
        return;
    }

    // [...]
}

int expand_stack(struct vm_area_struct *vma, unsigned long address)
{
    return expand_downwards(vma, address);
}

The expand_downwards() function performs the expanding operation. It retrieves the previous vma object from the maple tree [3] and checks whether it is also a stack (i.e., the VM_GROWSDOWN flag is set) [4]. If it is, the stack guard gap check is skipped. Afterward, the kernel sequentially holds vma->anon_vma write lock [5] and mm->page_table_lock spin lock [6]. Finally, it updates the memory range of the vma and syncs to maple tree [7].

int expand_downwards(struct vm_area_struct *vma, unsigned long address)
{
    struct mm_struct *mm = vma->vm_mm;
    struct vm_area_struct *prev;

    // [...]
    MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_start);
    prev = mas_prev(&mas, 0); // [3]
    if (prev && !(prev->vm_flags & VM_GROWSDOWN) &&  // [4]
            vma_is_accessible(prev)) {
        if (address - prev->vm_end < stack_guard_gap)
            return -ENOMEM;
    }

    // hold write lock &anon_vma->root->rwsem
    anon_vma_lock_write(vma->anon_vma); // [5]

    if (address < vma->vm_start) {
        size = vma->vm_end - address;
        grow = (vma->vm_start - address) >> PAGE_SHIFT;
        if (grow <= vma->vm_pgoff) {
            // [...]
            spin_lock(&mm->page_table_lock); // [6]
            vma->vm_start = address;
            vma->vm_pgoff -= grow;
            vma_mas_store(vma, &mas); // [7]
            spin_unlock(&mm->page_table_lock);
        }
    }

    anon_vma_unlock_write(vma->anon_vma);
    mas_destroy(&mas);
    // [...]
}

2.2. Maple Tree

According to the official documentation, the Maple Tree is a B-Tree optimized for storing non-overlapping ranges, which aligns with the charateristic that virtual memory areas do not overlap. The original writeup provides a introduction to the Maple Tree. Additionally, the Oracle Blog writes an article discussing about the pros and cons of using the Maple Tree.

There are some advantages to using the Maple Tree for storing vma objects. Most importantly, unlike an RB-tree that is tightly coupled with the vma, the Maple Tree is RCU-safe because its nodes are separate from the vma.


Let’s continue our analysis. The vma_mas_store() function first sets the memory range of the node [1] and then stores it into the tree [2].

void vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas)
{
    mas_set_range(mas, vma->vm_start, vma->vm_end - 1); // [1]
    mas_store_prealloc(mas, vma); // [2]
}

static inline
void mas_set_range(struct ma_state *mas, unsigned long start, unsigned long last)
{
    mas->index = start;
    mas->last = last;
    mas->node = MAS_START;
}

After setting up new the node, the mas_store_prealloc() function replaces old node with new one [3].

void mas_store_prealloc(struct ma_state *mas, void *entry)
{
    MA_WR_STATE(wr_mas, mas, entry);
    mas_wr_store_setup(&wr_mas);
    mas_wr_store_entry(&wr_mas); // [3]
    mas_destroy(mas);
}

Here we skip many codes since I am not familiar with the implementation of Maple Tree 🥲. The execution flow is like:

  • mas_wr_store_entry()
  • mas_wr_modify()
  • mas_wr_node_store()
  • mas_replace() (if tree is in RCU-safe mode)
  • mas_free()

Eventually, the mas_free() function is called to free the old node using RCU [4].

static inline void mas_free(struct ma_state *mas, struct maple_enode *used)
{
    struct maple_node *tmp = mte_to_node(used);

    if (mt_in_rcu(mas->tree))
        ma_free_rcu(tmp);
    // [...]
}

static void ma_free_rcu(struct maple_node *node)
{
    // [...]
    call_rcu(&node->rcu, mt_free_rcu); // [4]
}

static void mt_free_rcu(struct rcu_head *head)
{
    struct maple_node *node = container_of(head, struct maple_node, rcu);

    kmem_cache_free(maple_node_cache, node);
}

2.3. Walk a Tree

The find_vma_prev() function is called to retrieve the vma object of given address. It invokes the mas_walk() [1] to walk the whole tree and returns vma object.

struct vm_area_struct *
find_vma_prev(struct mm_struct *mm, unsigned long addr,
            struct vm_area_struct **pprev)
{
    struct vm_area_struct *vma;
    MA_STATE(mas, &mm->mm_mt, addr, addr);

    vma = mas_walk(&mas); // [1]
    *pprev = mas_prev(&mas, 0);
    // [...]
    return vma;
}

When mas_walk() is called, the kernel may not hold the RCU read lock, which can lead to issues if a maple node is being accessed while it is concurrenlty freed by mas_free(). Even though the Maple Tree is designed to be RCU-safe, a UAF could still occur if the reader doesn’t acquire the RCU read lock.

2.4. Exploit

Primitives:

  • Walk the Maple Tree and invoke function pointers by reading /proc/[pid]/maps.
  • Trigger an RCU free operation on a Maple Node by causing a page fault on the stack.
  • Ensure the RCU grace period has elapsed using sys_membarrier(MEMBARRIER_CMD_GLOBAL).

Steps:

  1. (CPU #0) Traverse the Maple Tree.
  2. (CPU #1) Trigger RCU Free on the Maple Node and induce stack expansion.
  3. Perform a cross-cache operation moving objects from the maple_node_cache to the buddy system.
  4. Reclaim freed Maple Node by spraying struct msg_msg.
  5. Use UAF to leak ktext by setting vma of the Maple Node to last IDT entry located at CEA.
  6. Spray task_struct objects by forking some processes.
  7. Use UAF to leak kheap by setting vma of the Maple Node to init_task.tasks.prev.
  8. Terminate the sprayed processes.
  9. Reclaim the freed memory of task_struct by spraying struct msg_msg containing stack pivoting gadget.
  10. Exploit UAF by assigning vma->vm_ops to the kheap, and vma->vm_ops->name() will be called to execute ROP.

2.5. Patch

If you have reviewd the fix commit, you may notice that know that the patch is quite big, with long description and updates many files. I think some of these changes are related to refactoring and are not directly associated with this vulnerability.

Rather than acquiring the RCU read lock before walking the Maple Tree, the delevopers resolved this vulnerabiliy by acquiring the write lock of mm object [1] before expanding the stack [2].

struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr)
{
    struct vm_area_struct *vma, *prev;

    mmap_read_unlock(mm);
    if (mmap_write_lock_killable(mm)) // [1]
        return NULL;

    vma = find_vma_prev(mm, addr, &prev);
    // [...]
    if (vma && !vma_expand_down(vma, addr)) // [2]
        goto success;
    // [...]
    mmap_write_unlock(mm);
    return NULL;
}