daemon: add seccomp filter for slirp4netns.

The container that slirp4netns runs in should already be quite difficult to do anything malicious in beyond basic denial of service or sending of network traffic. There is, however, one hole remaining in the case in which there is an adversary able to run code locally: abstract unix sockets. Because these are governed by network namespaces, not IPC namespaces, and slirp4netns is in the root network namespace, any process in the root network namespace can cooperate with the slirp4netns process to take over its user. To close this, we use seccomp to block the creation of unix-domain sockets by slirp4netns. This requires some finesse, since slirp4netns absolutely needs to be able to create other types of sockets - at minimum AF_INET and AF_INET6 Seccomp has many, many pitfalls. To name a few: 1. Seccomp provides you with an "arch" field, but this does not uniquely determine the ABI being used; the actual meaning of a system call number depends on both the number (which is often the result of ORing a related system call with a flag for an alternate ABI) and the architecture. 2. Seccomp provides no direct way of knowing what the native value for the arch field should be; the user must do configure/compile-time testing for every architecture+ABI combination they want to support. Amusingly enough, the linux-internal header files have this exact information (SECCOMP_ARCH_NATIVE), but they aren't sharing it. 3. The only system call numbers we naturally have are the native ones in asm/unistd.h. __NR_socket will always refer to the system call number for the target system's ABI. 4. Seccomp can only manipulate 32-bit words, but represents every system call argument as a uint64. 5. New system call numbers with as-yet-unknown semantics can be added to the kernel at any time. 6. Based on this comment in arch/x86/entry/syscalls/syscall_32.tbl: # 251 is available for reuse (was briefly sys_set_zone_reclaim) previously-invalid system call numbers may later be reused for new system calls. 7. Most architecture+ABI combinations have system call tables with many gaps in them. arm-eabi, for example, has 35 such gaps (note: this is just the number of distinct gaps, not the number of system call numbers contained in those gaps). 8. Seccomp's BPF filters require a fully-acyclic control flow graph. Any operation on a data structure must therefore first be fully unrolled before it can be run. 9. Seccomp cannot dereference pointers. Only the raw bits provided to the system calls can be inspected. 10. Some architecture+ABI combos have multiplexer system calls. For example, socketcall can perform any socket-related system call. The arguments to the multiplexed system call are passed indirectly, via a pointer to user memory. They therefore cannot be inspected by seccomp. 11. Some valid system calls are not listed in any table in the kernel source. For example, __ARM_NR_cacheflush is an "ARM private" system call. It does not appear in any *.tbl file. 12. Conditional branches are limited to relative jumps of at most 256 instructions forward. 13. Prior to Linux 4.8, any process able to spawn another process and call ptrace could bypass seccomp restrictions. To address (1), (2), and (3), we include preprocessor checks to identify the native architecture value, and reject all system calls that don't use the native architecture. To address (4), we use the AC_C_BIGENDIAN autoconf check to conditionally define WORDS_BIGENDIAN, and match up the proper portions of any uint64 we test for with the value in the accumulator being tested against. To address (5) and (6), we use system call pinning. That is, we hardcode a snapshot of all the valid system call numbers at the time of writing, and reject any system call numbers not in the recorded set. A set is recorded for every architecture+ABI combo, and the native one is chosen at compile-time. This ensures that not only are non-native architectures rejected, but so are non-native ABIs. For the sake of conciseness, we represent these sets as sets of disjoint ranges. Due to (7), checking each range in turn could add a lot of overhead to each system call, so we instead binary search through the ranges. Due to (8), this binary search has to be fully unrolled, so we do that too. It can be tedious and error-prone to manually produce the syscall ranges by looking at linux's *.tbl files, since the gaps are often small and uncommented. To address this, a script, build-aux/extract-syscall-ranges.sh, is added that will produce them given a *.tbl filename and an ABI regex (some tables seem to abuse the ABI field with strange values like "memfd_secret"). Note that producing the final values still requires looking at the proper asm/unistd.h file to find any private numbers and to identify any offsets and ABI variants used. (10) used to have no good solution, but in the past decade most architectures have gained dedicated system call alternatives to at least socketcall, so we can (hopefully) just block it entirely. To address (13), we block ptrace also. * build-aux/extract-syscall-ranges.sh: new script. * Makefile.am (EXTRA_DIST): register it. * config-daemon.ac: use AC_C_BIGENDIAN. * nix/libutil/spawn.cc (setNoNewPrivsAction, addSeccompFilterAction): new functions. * nix/libutil/spawn.hh (setNoNewPrivsAction, addSeccompFilterAction): new declarations. (SpawnContext)[setNoNewPrivs, addSeccompFilter]: new fields. * nix/libutil/seccomp.hh: new header file. * nix/libutil/seccomp.cc: new file. * nix/local.mk (libutil_a_SOURCES, libutil_headers): register them. * nix/libstore/build.cc (slirpSeccompFilter, writeSeccompFilterDot): new functions. (spawnSlirp4netns): use them, set seccomp filter for slirp4netns. Change-Id: Ic92c7f564ab12596b87ed0801b22f88fbb543b95 Signed-off-by: John Kehayias <john.kehayias@protonmail.com>
author: Reepca Russelstein <reepca@russelstein.xyz> 2025-04-29 08:17:38 -0500
committer: John Kehayias <john.kehayias@protonmail.com> 2025-06-24 10:07:58 -0400
commit: c659f977bb09de6d5615e6aa9efddedc1d9ff458 (patch)
tree: 16aa5273c64981b6abfe25e851ad2e61cbcd4df2 /nix/libstore/build.cc
parent: fb42611b8f27960304db5a1c0d33b8371dcde2a8 (diff)
1 files changed, 219 insertions, 0 deletions
diff --git a/nix/libstore/build.cc b/nix/libstore/build.cc
index 1a688f3b56..eee3a33a58 100644
--- a/nix/libstore/build.cc
+++ b/nix/libstore/build.cc
@@ -85,6 +85,13 @@
 /* This header isn't documented in 'man netdevice', but there doesn't seem to
    be any other way to get 'struct in6_ifreq'... */
 #include <linux/ipv6.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <seccomp.hh>
+
+/* Set to 1 to debug the seccomp filter.  */
+#define DEBUG_SECCOMP_FILTER 0
+
 #endif
 #endif
 
@@ -1815,6 +1822,7 @@ static void setupTap(int send_fd_socket, bool ipv6Enabled)
     sendFD(send_fd_socket, tapfd);
 }
 
+
 struct ChrootBuildSpawnContext : CloneSpawnContext {
     bool ipv6Enabled = false;
 };
@@ -1933,6 +1941,212 @@ static void remapIdsTo0Action(SpawnContext & sctx)
 }
 
 
+static std::vector<struct sock_filter> slirpSeccompFilter()
+{
+    std::vector<struct sock_filter> out;
+    struct sock_filter allow = BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW);
+    struct sock_filter deny = BPF_STMT(BPF_RET | BPF_K,
+                                       /* Could also use
+                                        * SECCOMP_RET_KILL_THREAD, but this
+                                        * gives nicer error messages. */
+                                       SECCOMP_RET_ERRNO | ENOSYS);
+    struct sock_filter silentDeny = BPF_STMT(BPF_RET | BPF_K,
+                                             SECCOMP_RET_ERRNO | 0);
+
+    /* instructions to check for AF_INET or AF_INET6 in the first argument */
+    std::vector<struct sock_filter> allowInet;
+    seccompMatchu64(allowInet,
+                    AF_INET,
+                    {allow},
+                    offsetof(struct seccomp_data, args[0]));
+    seccompMatchu64(allowInet,
+                    AF_INET6,
+                    {allow},
+                    offsetof(struct seccomp_data, args[0]));
+    /* ... and deny otherwise */
+    std::vector<struct sock_filter> denyNonInet;
+    denyNonInet.insert(denyNonInet.begin(), allowInet.begin(), allowInet.end());
+    denyNonInet.push_back(deny);
+
+    /* ... and silent variant. */
+    std::vector<struct sock_filter> silentDenyNonInet;
+
+    silentDenyNonInet.insert(silentDenyNonInet.begin(), allowInet.begin(), allowInet.end());
+    silentDenyNonInet.push_back(silentDeny);
+
+    /* accumulator <-- data.arch */
+    out.push_back(BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, arch))));
+    /* Deny if non-native arch.  This simplifies checks as we can now just use
+     * the __NR_* syscall numbers. */
+    out.push_back(BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K,
+                           AUDIT_ARCH_NATIVE,
+                           1,
+                           0));
+    out.push_back(deny);
+
+    std::vector<Uint32RangeAction> specialCaseActions;
+
+#ifdef __NR_socket
+    Uint32RangeAction socketAction;
+    socketAction.low = __NR_socket;
+    socketAction.high = __NR_socket;
+    socketAction.instructions = denyNonInet;
+    specialCaseActions.push_back(socketAction);
+#endif
+
+#ifdef __NR_socketpair
+    /* socketpair can be used to create unix sockets.  Presumably they can't
+     * be re-bound or reconnected to use the abstract unix socket namespace,
+     * since they're already connected, but let's not risk it - slirp4netns
+     * shouldn't have a reason to use any IPC anyway. */
+    Uint32RangeAction socketpairAction;
+    socketpairAction.low = __NR_socketpair;
+    socketpairAction.high = __NR_socketpair;
+    /* The silent variant is necessary for socketpair because slirp4netns
+       unconditionally creates a unix socket using socketpair for using setns
+       to exfiltrate a tapfd, despite not actually needing to do that at all
+       since we pass it the tapfd directly.  It will refuse to start if
+       socketpair returns anything but 0, so we have no choice but to do that.
+       The would-be-returned socket fds are never used. */
+    socketpairAction.instructions = silentDenyNonInet;
+    specialCaseActions.push_back(socketpairAction);
+#endif
+
+#ifdef __NR_socketcall
+    /* Some architectures include a system call "socketcall" for multiplexing
+     * all the socket-related calls.  This system call only accepts two
+     * arguments: a number to indicate which socket-related system call to
+     * invoke, and a pointer to an array holding the arguments for it.
+     * Seccomp can't inspect the contents of memory, only the raw bits passed
+     * to the kernel, so there's no way to only disallow certain invocations
+     * of a socket-related system call.  In the past decade, most linux
+     * architectures which relied on "socketcall" have since added dedicated
+     * system calls (socket, socketpair, connect, etc) that can be used
+     * instead of socketcall, and it was mostly uncommon architectures that
+     * relied on it in the first place, so we should be fine to just block it
+     * outright. */
+    Uint32RangeAction socketcallAction;
+    socketcallAction.low = __NR_socketcall;
+    socketcallAction.high = __NR_socketcall;
+    socketcallAction.instructions = {deny};
+    specialCaseActions.push_back(socketcallAction);
+#endif
+
+    /* Kernels before 4.8 allow a process to bypass seccomp restrictions by
+     * spawning another process to ptrace it and modify a system call after
+     * the seccomp check. */
+    Uint32RangeAction ptraceAction;
+    ptraceAction.low = __NR_ptrace;
+    ptraceAction.high = __NR_ptrace;
+    ptraceAction.instructions = { deny };
+    specialCaseActions.push_back(ptraceAction);
+
+    std::vector<struct sock_filter> specialCases =
+        rangeActionsToFilter(specialCaseActions);
+
+    /* accumulator <-- data.nr */
+    out.push_back(BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))));
+
+    out.insert(out.end(), specialCases.begin(), specialCases.end());
+
+    /* accumulator <-- data.nr again */
+    out.push_back(BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))));
+
+    std::vector<Uint32RangeAction> pinnedSyscallRanges = NATIVE_SYSCALL_RANGES;
+    if(pinnedSyscallRanges.size() != 0) {
+        for(auto & i : pinnedSyscallRanges) {
+            i.instructions.push_back(allow);
+        }
+        std::vector<struct sock_filter> pinnedWhitelist = rangeActionsToFilter(pinnedSyscallRanges);
+        out.insert(out.end(), pinnedWhitelist.begin(), pinnedWhitelist.end());
+        out.push_back(deny);
+    }
+    else {
+        /* Couldn't determine pinned system calls, resort to allowing by
+         * default. */
+        out.push_back(allow);
+    }
+    return out;
+}
+
+
+#if DEBUG_SECCOMP_FILTER
+
+/* Note: limited to only the subset we actually use, makes various
+ * assumptions, not general-purpose. */
+static void writeSeccompFilterDot(std::vector<struct sock_filter> filter, FILE *f)
+{
+    fprintf(f, "digraph filter { \n");
+    for(size_t j = 0; j < filter.size(); j++) {
+        switch(BPF_CLASS(filter[j].code)) {
+        case BPF_LD:
+            fprintf(f, "\"%zu\" [label=\"load into accumulator from offset %u\"];\n",
+                    j, filter[j].k);
+            fprintf(f, "\"%zu\" -> \"%zu\";\n", j, j + 1);
+            break;
+        case BPF_JMP:
+            switch(BPF_OP(filter[j].code)) {
+            case BPF_JA:
+                fprintf(f, "\"%zu\" [label=\"unconditional jump\"];\n", j);
+                fprintf(f, "\"%zu\" -> \"%zu\";\n", j, j + filter[j].k + 1);
+                break;
+            case BPF_JEQ:
+                fprintf(f, "\"%zu\" [label=\"jump if accumulator = %u\"];\n", j,
+                        filter[j].k);
+                fprintf(f, "\"%zu\" -> \"%zu\" [label=\"true\"];\n", j,
+                        j + filter[j].jt + 1);
+                fprintf(f, "\"%zu\" -> \"%zu\" [label=\"false\"];\n", j,
+                        j + filter[j].jf + 1);
+                break;
+            case BPF_JGT:
+                fprintf(f, "\"%zu\" [label=\"jump if accumulator > %u\"];\n", j,
+                        filter[j].k);
+                fprintf(f, "\"%zu\" -> \"%zu\" [label=\"true\"];\n", j,
+                        j + filter[j].jt + 1);
+                fprintf(f, "\"%zu\" -> \"%zu\" [label=\"false\"];\n", j,
+                        j + filter[j].jf + 1);
+                break;
+            case BPF_JGE:
+                fprintf(f, "\"%zu\" [label=\"jump if accumulator >= %u\"];\n", j,
+                        filter[j].k);
+                fprintf(f, "\"%zu\" -> \"%zu\" [label=\"true\"];\n", j,
+                        j + filter[j].jt + 1);
+                fprintf(f, "\"%zu\" -> \"%zu\" [label=\"false\"];\n", j,
+                        j + filter[j].jf + 1);
+                break;
+            default:
+                fprintf(stderr, "unrecognized jump operation at %zu: %d\n", j, BPF_OP(filter[j].code));
+            }
+            break;
+        case BPF_RET:
+            switch(filter[j].k & SECCOMP_RET_ACTION_FULL) {
+            case SECCOMP_RET_KILL_PROCESS:
+                fprintf(f, "\"%zu\" [label=\"kill the process\"];\n", j);
+                break;
+            case SECCOMP_RET_KILL_THREAD:
+                fprintf(f, "\"%zu\" [label=\"kill the thread\"];\n", j);
+                break;
+            case SECCOMP_RET_ERRNO:
+                fprintf(f, "\"%zu\" [label=\"return errno for \\\"%s\\\"\"];\n",
+                        j, strerror(filter[j].k & SECCOMP_RET_DATA));
+                break;
+            case SECCOMP_RET_ALLOW:
+                fprintf(f, "\"%zu\" [label=\"allow system call\"];\n", j);
+                break;
+            default:
+                fprintf(stderr, "unrecognized return operation at %zu: %d\n", j, filter[j].k);
+                break;
+            }
+            break;
+        default:
+            fprintf(stderr, "unrecognized bpf class at %zu: %d\n", j, BPF_CLASS(filter[j].code));
+        }
+    }
+    fprintf(f, "}\n");
+}
+
+#endif
+
 /* Spawn 'slirp4netns' in separate namespaces as the given user and group;
    'tapfd' must correspond to a /dev/net/tun connection.  Configure it to
    write to 'notifyReadyFD' once it's up and running.  */
@@ -2016,6 +2230,11 @@ static pid_t spawnSlirp4netns(int tapfd, int notifyReadyFD,
         slirpCtx.logFD = devNullFd;
     }
 
+#if DEBUG_SECCOMP_FILTER
+    writeSeccompFilterDot(slirpCtx.seccompFilter, stderr);
+    fflush(stderr);
+#endif
+
     addPhaseAfter(slirpCtx.phases,
                   "makeChrootSeparateFilesystem",
                   "prepareSlirpChroot",
author	Reepca Russelstein <reepca@russelstein.xyz>	2025-04-29 08:17:38 -0500
committer	John Kehayias <john.kehayias@protonmail.com>	2025-06-24 10:07:58 -0400
commit	c659f977bb09de6d5615e6aa9efddedc1d9ff458 (patch)
tree	16aa5273c64981b6abfe25e851ad2e61cbcd4df2 /nix/libstore/build.cc
parent	fb42611b8f27960304db5a1c0d33b8371dcde2a8 (diff)