Seccomp

The seccomp system is used to filter the syscalls that sandboxed processes can use. The form of seccomp used by crosvm (SECCOMP_SET_MODE_FILTER) allows for a BPF program to be used. To generate the BPF programs, crosvm uses minijail's policy file format. A policy file is written for each device per architecture. Each device requires a unique set of syscalls to accomplish their function and each architecture has slightly different naming for similar syscalls. The ChromeOS docs have a useful listing of syscalls.

The seccomp policies are compiled from .policy source files into BPF bytecode by jail/build.rs and embedded in the crosvm executable, so it is not necessary to install the seccomp policy files, only the crosvm binary itself. Be sure to remember to rebuild crosvm after changing a policy file to observe the updated behavior.

Writing a Policy for crosvm

The detailed rules for naming policy files can be found in jail/seccomp/README.md

Most policy files will include the common_device.policy from a given architecture using this directive near the top:

@include /usr/share/policy/crosvm/common_device.policy

The common device policy for x86_64 is:

# This is an allow list of syscalls for most of crosvm devices.
#
# Note that some device policy files don't depend on this policy file
# because of some conflicts such as gpu_common.policy.
# If you want to modify policies for all the devices, please modify
# not only this file but also other *_common.policy files.

@frequency ./common_device.frequency
brk: 1
clock_gettime: 1
clone: arg0 & CLONE_THREAD
clone3: 1
close: 1
dup2: 1
dup: 1
epoll_create1: 1
epoll_ctl: 1
epoll_pwait: 1
epoll_wait: 1
eventfd2: 1
exit: 1
exit_group: 1
ftruncate: 1
futex: 1
getcwd: 1
getpid: 1
gettid: 1
gettimeofday: 1
io_uring_setup: 1
io_uring_register: 1
io_uring_enter: 1
kill: 1
lseek: 1
madvise: arg2 == MADV_DONTNEED || arg2 == MADV_DONTDUMP || arg2 == MADV_REMOVE || arg2 == MADV_MERGEABLE || arg2 == MADV_FREE
membarrier: 1
memfd_create: 1
mmap: arg2 in ~PROT_EXEC
mprotect: arg2 in ~PROT_EXEC
mremap: 1
munmap: 1
nanosleep: 1
clock_nanosleep: 1
pipe2: 1
poll: 1
ppoll: 1
read: 1
readlink: 1
readlinkat: 1
readv: 1
recvfrom: 1
recvmsg: 1
restart_syscall: 1
rseq: 1
rt_sigaction: 1
rt_sigprocmask: 1
rt_sigreturn: 1
sched_getaffinity: 1
sched_yield: 1
sendmsg: 1
sendto: 1
set_robust_list: 1
sigaltstack: 1
tgkill: arg2 == SIGABRT
write: 1
writev: 1
fcntl: 1
uname: 1

## Rules for vmm-swap
userfaultfd: 1
# 0xc018aa3f == UFFDIO_API, 0xaa00 == USERFAULTFD_IOC_NEW
ioctl: arg1 == 0xc018aa3f || arg1 == 0xaa00

The syntax is simple: one syscall per line, followed by a colon :, followed by a boolean expression used to constrain the arguments of the syscall. The simplest expression is 1 which unconditionally allows the syscall. Only simple expressions work, often to allow or deny specific flags. A major limitation is that checking the contents of pointers isn't possible using minijail's policy format. If a syscall is not listed in a policy file, it is not allowed.