Sandlock
Lightweight process sandbox for Linux. Confines untrusted code using Landlock (filesystem + network + IPC), seccomp-bpf (syscall filtering), and seccomp user notification (resource limits, IP enforcement, /proc virtualization). No root, no cgroups, no containers.
sandlock run -w /tmp -r /usr -r /lib -m 512M -- python3 untrusted.py
Why Sandlock?
Containers and VMs are powerful but heavy. Sandlock targets the gap: strict confinement without image builds or root privileges. Built-in COW filesystem protects your working directory automatically.
| Feature | Sandlock | Container | MicroVM (Firecracker) |
|---|---|---|---|
| Root required | No | Yes* | Yes (KVM) |
| Image build | No | Yes | Yes |
| Startup time | ~5 ms | ~200 ms | ~100 ms |
| Kernel | Shared | Shared | Separate guest |
| Filesystem isolation | Landlock + seccomp COW | Overlay | Block-level |
| Network isolation | Landlock + seccomp notif | Network namespace | TAP device |
| HTTP-level ACL | Method + host + path rules | N/A | N/A |
| Syscall filtering | seccomp-bpf | seccomp | N/A |
| Resource limits | seccomp notif + SIGSTOP | cgroup v2 | VM config |
* Rootless containers exist but require user namespace support and /etc/subuid configuration.
Architecture
Sandlock is implemented in Rust for performance and safety:
- sandlock-core — Rust library: Landlock, seccomp, supervisor, COW, pipeline
- sandlock-cli — Rust CLI binary (
sandlock run ...) - sandlock-ffi — C ABI shared library (
libsandlock_ffi.so) - Python SDK — ctypes bindings to the FFI library
┌─────────────┐
│ Python SDK │ ctypes FFI
│ (sandlock) │──────────────┐
└─────────────┘ │
▼
┌──────────────┐ ┌──────────────────────────────┐
│ sandlock CLI │───>│ libsandlock_ffi.so │
└──────────────┘ └──────────────┬───────────────┘
│
┌──────────────▼───────────────┐
│ sandlock-core │
│ Landlock · seccomp · COW · │
│ pipeline · policy_fn · vDSO │
└──────────────────────────────┘
Requirements
- Linux 6.12+ (Landlock ABI v6), Rust 1.70+ (to build)
- Python 3.8+ (optional, for Python SDK)
- No root, no cgroups
| Feature | Minimum kernel |
|---|---|
| seccomp user notification | 5.6 |
| Landlock filesystem rules | 5.13 |
| Landlock TCP port rules | 6.7 (ABI v4) |
| Landlock IPC scoping | 6.12 (ABI v6) |
Install
From source
# Build the Rust binary and shared library
# Install Python SDK (auto-builds Rust FFI library)
&&
CLI only
Quick Start
CLI
# Basic confinement
# Interactive shell
# Resource limits + timeout
# Outbound allowlist — restrict to one host on one port
# Multiple ports for one host, plus a separate any-IP port
# Wildcard port — `host:*` permits every port to the host
# Unrestricted outbound — `:*` opens any host and any TCP port. For full
# egress add a UDP wildcard via the `udp://*:*` scheme.
# UDP — scheme prefix gates the protocol and scopes the destination
# (e.g. DNS to 1.1.1.1, plus TCP HTTPS to anywhere)
# Ping — kernel ping socket (SOCK_DGRAM) gated by net.ipv4.ping_group_range
# HTTP-level ACL (method + host + path rules via transparent proxy)
# HTTP rules with concrete hosts auto-extend --net-allow with host:80,443
# HTTPS MITM with user-provided CA (enables ACL on port 443)
# Generate a CA, add the cert to the sandbox's trust store
# (e.g. /etc/ssl/certs/), then pass both files here.
# Server listening on a port (Landlock --net-bind, separate from --net-allow)
# Clean environment
# Deterministic execution (frozen time + seeded randomness)
# Port virtualization (multiple sandboxes can bind the same port)
# Port virtualization with named sandboxes (enables network discovery)
# List all running sandboxes
# Kill a running sandbox by name
# Chroot with per-sandbox mount (no kernel bind mount needed)
# COW filesystem (writes captured, committed on success)
# Dry-run (show what files would change, then discard)
# Use a saved profile
# No-supervisor mode (Landlock + deny-only seccomp, no supervisor process)
# Nested sandboxing: confine sandlock's own supervisor
Python API
=
# Run a command (with optional timeout in seconds)
=
assert
assert b in
# HTTP ACL: only allow specific API calls
=
=
# Chroot with per-sandbox mount (Docker-style -v, no root needed)
=
=
# Port virtualization: query port mappings while sandbox is running
=
# sb.ports() returns {virtual_port: real_port} while running
# Confine the current process (Landlock filesystem only, irreversible)
# Dry-run: see what files would change, then discard
=
=
# A=added, M=modified, D=deleted
Pipeline
Chain sandboxed stages with the | operator — each stage has its own
independent sandbox config. Data flows through kernel pipes.
=
=
# Reader can access data, processor cannot
=
assert b in
XOA pattern (eXecute Over Architecture) — planner generates code, executor runs it with data access but no network:
=
=
=
Dynamic Policy (policy_fn)
Inspect syscall events at runtime and adjust permissions on the fly.
Events carry syscall name, category, PID, network destination (for
connect/sendto/bind), and argv (for execve). The callback
returns a verdict to allow, deny, or audit.
# Block download tools by argv
return True # deny
# Deny connections to a specific IP
return
# Lock down once the program has finished starting up
# block all network
# dynamic fs deny
# Audit every file access (allow but flag)
return
return 0 # allow
=
=
Verdicts: 0/False = allow, True/-1 = deny (EPERM),
positive int = deny with errno, "audit"/-2 = allow + flag.
Event fields: syscall, category (file/network/process/memory),
pid, parent_pid, host, port, argv, denied.
TOCTOU NOTE Per
seccomp_unotify(2), the kernel re-reads user-memory pointers afterContinue. Sandlock handles this in two places:
- Path strings are not exposed on events. Path-based access control belongs in static Landlock rules (
fs_readable/fs_writable/fs_denied) — kernel-enforced and TOCTOU-immune. Usectx.deny_path()for runtime additions.event.argvis exposed and TOCTOU-safe. Before exposingargvtopolicy_fnor returningContinuefor anexecve, the supervisor freezes every task inProcessIndex, including peer processes that may alias argv through shared memory. Withpolicy_fnactive, fork-like syscalls are traced for one ptrace creation event, so children are registered inProcessIndexbefore they can run user code. If the freeze or creation tracking cannot be established (e.g., YAMA blocks ptrace), the syscall is denied withEPERM; the safety invariant is never silently relaxed.
Context methods:
ctx.restrict_network(ips)/ctx.grant_network(ips)— network controlctx.restrict_max_memory(bytes)/ctx.restrict_max_processes(n)— resource limitsctx.deny_path(path)/ctx.allow_path(path)— dynamic filesystem restrictionctx.restrict_pid_network(pid, ips)— per-PID network override
Held syscalls (child blocked until callback returns): execve,
connect, sendto, bind, openat.
Rust API
use ;
use ByteSize;
use Verdict;
// Basic run
let mut sandbox = builder
.fs_read.fs_read
.fs_write
.max_memory
.name
.build?;
let result = sandbox.run.await?;
assert!;
// HTTP ACL: restrict API access at the HTTP level
let mut agent = builder
.fs_read.fs_read.fs_read
.http_allow
.http_deny
.name
.build?;
let result = agent.run.await?;
// Confine the current process (Landlock filesystem only, irreversible)
let confinement = builder
.fs_read.fs_read
.fs_write
.build;
confine?;
// Pipeline
let producer = builder
.fs_read.fs_read.fs_read
.build?;
let consumer = producer.clone;
let result = .run.await?;
// Dynamic policy
let mut dynamic = builder
.fs_read.fs_read
.policy_fn
.build?;
let result = dynamic.run.await?;
Profiles
Save reusable sandbox profiles as TOML files in
~/.config/sandlock/profiles/. Profiles use a sectioned schema; top-level
flat keys such as fs_readable = [...] are rejected. Pass a sandbox instance
name with --name when you need a stable virtual hostname.
# ~/.config/sandlock/profiles/build.toml
[]
= "make"
= ["-j4"]
= true
= { = "gcc", = "C.UTF-8" }
[]
= ["/usr", "/lib", "/lib64", "/bin", "/etc"]
= ["/tmp/work"]
[]
= "512M"
= 50
[]
= []
How It Works
Sandlock applies confinement in sequence after fork():
Parent Child
│ fork() │
│──────────────────────────────────>│
│ ├─ 1. setpgid(0,0)
│ ├─ 2. Optional: chdir(cwd)
│ ├─ 3. NO_NEW_PRIVS
│ ├─ 4. Landlock (fs + net + IPC)
│ ├─ 5. seccomp filter (deny + notif)
│ │ └─ send notif fd ──> Parent
│ receive notif fd ├─ 6. Wait for "ready" signal
│ start supervisor (tokio) ├─ 7. Close fds 3+
│ optional: vDSO patching └─ 8. exec(cmd)
│ optional: policy_fn thread
│ optional: CPU throttle task
Seccomp Supervisor
The async notification supervisor (tokio) handles intercepted syscalls:
| Syscall | Handler |
|---|---|
clone/fork/vfork |
Process count enforcement |
mmap/munmap/brk/mremap |
Memory limit tracking |
connect/sendto/sendmsg |
IP allowlist + on-behalf execution + HTTP ACL redirect |
bind |
On-behalf bind + port remapping |
openat |
/proc virtualization, COW interception |
unlinkat/mkdirat/renameat2 |
COW write interception |
execve/execveat |
policy_fn hold + vDSO re-patching |
getrandom |
Deterministic PRNG injection |
clock_nanosleep/timer_settime |
Timer adjustment for frozen time |
getdents64 |
PID filtering, COW directory merging |
getsockname |
Port remap translation |
Custom Handlers
Downstream Rust crates can append their own seccomp-notification
handlers to the supervisor chain alongside the builtins, registering
for any syscall they care about via the Handler trait and
Sandbox::run_with_handlers. The builtin chain runs first, so
user handlers cannot subvert confinement; the registration step also
rejects handlers on syscalls in the default blocklist or
extra_deny_syscalls. See
docs/extension-handlers.md for the
full API, ordering semantics, and state patterns.
COW Filesystem
Two modes of copy-on-write filesystem isolation:
Seccomp COW (default when workdir is set): Intercepts filesystem
syscalls via seccomp notification. Writes go to an upper directory;
reads resolve upper-then-lower. No mount namespace, no root. Committed
on exit, aborted on error.
OverlayFS COW: Uses kernel OverlayFS in a user namespace. Requires unprivileged user namespaces to be enabled.
Dry-run mode: --dry-run runs the command, inspects the COW layer
for changes (added/modified/deleted files), prints a summary, then
aborts — leaving the workdir completely untouched. Useful for previewing
what a command would do before committing.
COW Fork & Map-Reduce
Initialize expensive state once, then fork COW clones that share memory.
Each clone uses raw fork(2) with shared copy-on-write pages. 1000
clones in ~530ms, ~1,900 forks/sec.
Each clone's stdout is captured via its own pipe. reduce() reads all
pipes and feeds combined output to a reducer's stdin — fully pipe-based
data flow with no temp files.
global ,
= # 2 GB, loaded once
=
=
# stdout → per-clone pipe
# Map: fork 4 clones with a separate sandbox config
=
=
# Reduce: pipe clone outputs to reducer stdin
=
=
# b"total\n"
let mut mapper = builder
.fs_read.fs_read.fs_read.fs_read
.fs_read
.name
.init_fn
.work_fn
.build?;
let mut clones = mapper.fork.await?;
let reducer = builder
.fs_read.fs_read.fs_read.fs_read
.name
.build?;
let result = reducer.reduce.await?;
Map and reduce run in separate sandboxes with independent configs —
the mapper has data access, the reducer doesn't. Each clone inherits
Landlock + seccomp confinement. CLONE_ID=0..N-1 is set automatically.
Network Model
Outbound traffic is gated by a single endpoint allowlist that names
protocol × destination. Each --net-allow rule is one of:
--net-allow <spec> repeatable; no rules = deny all outbound
bare form host:port[,port,...] / :port / *:port / host:* / :* / *:* (TCP)
tcp:// same suffix grammar — explicit TCP
udp:// same suffix grammar — UDP (`udp://*:*` opens any UDP)
icmp:// host or `*`, no port — kernel ping socket (SOCK_DGRAM)
Multiple rules are OR'd. A destination is permitted iff some rule matches the same protocol as the socket plus the destination IP and port (port is N/A for ICMP).
Protocol gating falls out of rule presence per scheme:
- No UDP rule → UDP socket creation is denied at the seccomp layer.
- No ICMP rule → kernel ping socket creation (SOCK_DGRAM + IPPROTO_ICMP) is denied at the seccomp layer.
- Raw ICMP (SOCK_RAW + IPPROTO_ICMP) is never exposed — packet
crafting is out of scope. Workloads that need ping should rely on
the host's
net.ipv4.ping_group_rangeand use the dgram path above (--net-allow icmp://...). - TCP is always permitted at the syscall level; destinations are governed by Landlock and/or the on-behalf path.
Defaults. With no --net-allow and no HTTP ACL flags, Landlock
denies every TCP connect(), UDP / ICMP / raw socket creation are
denied at the seccomp layer, and there is no on-behalf path active.
For unrestricted TCP egress, opt in explicitly with
--net-allow :*; for any UDP, add --net-allow udp://*:*.
Resolution. Concrete hostnames are resolved once at sandbox start
and pinned in a synthetic /etc/hosts (across all protocols). The
synthetic file replaces the real one only when at least one rule has
a concrete host; pure :port / udp://*:* / icmp://* rules leave
the real /etc/hosts and DNS visible.
Wildcards. Hostnames are matched literally — --net-allow *.example.com:443 is not supported, list each domain you need.
The * token is allowed as the host (alias for empty: *:port ≡
:port) and as the port for TCP/UDP rules (host:*, :*, *:*,
udp://*:*). Mixing * with concrete ports (host:80,*) is
rejected. When any TCP rule uses the all-ports wildcard, Landlock no
longer filters TCP connect at the kernel level (it cannot express
"every port" without enumerating 65535 rules); the on-behalf path
becomes the sole enforcer, and for :* it short-circuits to
allow-all.
Implementation. Two enforcement paths:
- Direct path — pure
:portTCP policies (no concrete host) and no HTTP ACL. Landlock enforces the TCP port allowlist at the kernel level; no per-syscall overhead. UDP and ICMP are not covered by Landlock and always use the on-behalf path when allowed. - On-behalf path — any concrete host, any HTTP ACL rule, or any
UDP / ICMP rule. Seccomp traps
connect(),sendto(),sendmsg(), andsendmmsg(); the supervisor dups the child fd, queriesgetsockopt(SOL_SOCKET, SO_PROTOCOL)to learn whether the socket is TCP / UDP / ICMP, then checks the destination against that protocol's resolved allowlist before performing the syscall. The HTTP/HTTPS proxy redirect (when configured) happens here too.
HTTP / HTTPS interception. --http-allow / --http-deny route
matching ports through a transparent proxy. Each rule with a concrete
host auto-extends --net-allow with host:80 (and host:443 when
--http-ca is set) so the proxy's intercept ports are reachable;
wildcard hosts auto-add :80 / :443 (any IP). All auto-added
entries are TCP. HTTPS MITM is opt-in: pass --http-ca <cert> and
--http-key <key> for a CA you generate and trust inside the
sandbox (typically install the cert into the workload's
/etc/ssl/certs/). Without --http-ca, port 443 is not intercepted
— --net-allow host:443 permits raw TLS to the host with no content
inspection.
Bind. --net-bind <port> is independent from --net-allow and
governs server-side bind(). Landlock enforces it (TCP only);
--port-remap adds on-behalf virtualization for binding.
AF_UNIX sockets are governed by Landlock's
LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET, independent from --net-allow.
Port Virtualization
Each sandbox gets a full virtual port space. Multiple sandboxes can bind
the same port without conflicts. The supervisor performs bind() on behalf
of the child via pidfd_getfd (TOCTOU-safe). When a port conflicts, a
different real port is allocated transparently. /proc/net/tcp is filtered
to only show the sandbox's own ports.
When --port-remap is enabled, the sandbox registers its state in a
shared registry (/dev/shm). Use sandlock list to see all running
sandboxes and sandlock kill to stop them:
$ sandlock list
NAME PID PORTS
api.local 12345 8080
web.local 12346 8080 -> 35299
$ sandlock kill web.local
Killed sandbox 'web.local' (PID 12346)
This enables external reverse proxies (nginx, envoy) to route traffic by name to the correct real port.
Performance
Benchmarked on a typical Linux workstation:
| Workload | Bare metal | Sandlock | Docker | Sandlock overhead |
|---|---|---|---|---|
/bin/echo startup |
2 ms | 7 ms | 307 ms | 5 ms (44x faster than Docker) |
| Redis SET (100K ops) | 82K rps | 80K rps | 52K rps | 97.1% of bare metal |
| Redis GET (100K ops) | 79K rps | 77K rps | 53K rps | 97.1% of bare metal |
| Redis p99 latency | 0.5 ms | 0.6 ms | 1.5 ms | ~2.5x lower than Docker |
| COW fork ×1000 | — | 530 ms | — | 530μs/fork, ~1,900 forks/sec |
Testing
# Rust tests
# Python tests
&& &&
Sandbox Reference
The full Sandbox configuration reference — every field, default,
and grouping — lives in docs/sandbox-reference.md.