Sandlock
Lightweight process sandbox for Linux. Confines untrusted code using Landlock (filesystem + network + IPC), seccomp-bpf (syscall filtering), and seccomp user notification (resource limits, IP enforcement, /proc virtualization). No root, no cgroups, no containers.
sandlock run -w /tmp -r /usr -r /lib -m 512M -- python3 untrusted.py
Why Sandlock?
Containers and VMs are powerful but heavy. Sandlock targets the gap: strict confinement without image builds or root privileges. Built-in COW filesystem protects your working directory automatically.
| Feature | Sandlock | Container | MicroVM (Firecracker) |
|---|---|---|---|
| Root required | No | Yes* | Yes (KVM) |
| Image build | No | Yes | Yes |
| Startup time | ~5 ms | ~200 ms | ~100 ms |
| Kernel | Shared | Shared | Separate guest |
| Filesystem isolation | Landlock + seccomp COW | Overlay | Block-level |
| Network isolation | Landlock + seccomp notif | Network namespace | TAP device |
| Syscall filtering | seccomp-bpf | seccomp | N/A |
| Resource limits | seccomp notif + SIGSTOP | cgroup v2 | VM config |
* Rootless containers exist but require user namespace support and /etc/subuid configuration.
Architecture
Sandlock is implemented in Rust for performance and safety:
- sandlock-core — Rust library: Landlock, seccomp, supervisor, COW, pipeline
- sandlock-cli — Rust CLI binary (
sandlock run ...) - sandlock-ffi — C ABI shared library (
libsandlock_ffi.so) - Python SDK — ctypes bindings to the FFI library
┌─────────────┐
│ Python SDK │ ctypes FFI
│ (sandlock) │──────────────┐
└─────────────┘ │
▼
┌──────────────┐ ┌──────────────────────────────┐
│ sandlock CLI │───>│ libsandlock_ffi.so │
└──────────────┘ └──────────────┬───────────────┘
│
┌──────────────▼───────────────┐
│ sandlock-core │
│ Landlock · seccomp · COW · │
│ pipeline · policy_fn · vDSO │
└──────────────────────────────┘
Requirements
- Linux 6.12+ (Landlock ABI v6), Rust 1.70+ (to build)
- Python 3.8+ (optional, for Python SDK)
- No root, no cgroups
| Feature | Minimum kernel |
|---|---|
| seccomp user notification | 5.6 |
| Landlock filesystem rules | 5.13 |
| Landlock TCP port rules | 6.7 (ABI v4) |
| Landlock IPC scoping | 6.12 (ABI v6) |
Install
From source
# Build the Rust binary and shared library
# Install Python SDK (auto-builds Rust FFI library)
&&
CLI only
Quick Start
CLI
# Basic confinement
# Interactive shell
# Resource limits + timeout
# Domain-based network isolation
# TCP port restrictions (Landlock)
# IPC scoping + clean environment
# Deterministic execution (frozen time + seeded randomness)
# Port virtualization (multiple sandboxes can bind the same port)
# COW filesystem (writes captured, committed on success)
# Use a saved profile
Python API
=
# Run a command
=
assert
assert b in
Pipeline
Chain sandboxed stages with the | operator — each stage has its own
independent policy. Data flows through kernel pipes.
=
=
# Reader can access data, processor cannot
=
assert b in
XOA pattern (eXecute Over Architecture) — planner generates code, executor runs it with data access but no network:
=
=
=
Dynamic Policy (policy_fn)
Inspect syscall events at runtime and adjust permissions on the fly. Each event includes rich metadata: path, host, port, argv, category, parent PID. The callback returns a verdict to allow, deny, or audit.
# Block download tools
return True # deny
# Custom errno for sensitive files
return
# Restrict network after setup phase
# Audit file access (allow but flag)
return
return 0 # allow
=
=
Verdicts: 0/False = allow, True/-1 = deny (EPERM),
positive int = deny with errno, "audit"/-2 = allow + flag.
Event fields: syscall, category (file/network/process/memory),
pid, parent_pid, path, host, port, argv, denied.
Context methods:
ctx.restrict_network(ips)/ctx.grant_network(ips)— network controlctx.restrict_max_memory(bytes)/ctx.restrict_max_processes(n)— resource limitsctx.deny_path(path)/ctx.allow_path(path)— dynamic filesystem restrictionctx.restrict_pid_network(pid, ips)— per-PID network override
Held syscalls (child blocked until callback returns): execve,
connect, sendto, bind, openat.
Rust API
use ;
// Basic run
let policy = builder
.fs_read.fs_read
.fs_write
.max_memory
.build?;
let result = run.await?;
assert!;
// Pipeline
let result = .run.await?;
// Dynamic policy
use Verdict;
let policy = builder
.fs_read.fs_read
.policy_fn
.build?;
Profiles
Save reusable policies as TOML files in ~/.config/sandlock/profiles/:
# ~/.config/sandlock/profiles/build.toml
= ["/tmp/work"]
= ["/usr", "/lib", "/lib64", "/bin", "/etc"]
= true
= true
= "512M"
= 50
[]
= "gcc"
= "C.UTF-8"
How It Works
Sandlock applies confinement in sequence after fork():
Parent Child
│ fork() │
│──────────────────────────────────>│
│ ├─ 1. setpgid(0,0)
│ ├─ 2. Optional: chdir(workdir)
│ ├─ 3. NO_NEW_PRIVS
│ ├─ 4. Landlock (fs + net + IPC)
│ ├─ 5. seccomp filter (deny + notif)
│ │ └─ send notif fd ──> Parent
│ receive notif fd ├─ 6. Wait for "ready" signal
│ start supervisor (tokio) ├─ 7. Close fds 3+
│ optional: vDSO patching └─ 8. exec(cmd)
│ optional: policy_fn thread
│ optional: CPU throttle task
Seccomp Supervisor
The async notification supervisor (tokio) handles intercepted syscalls:
| Syscall | Handler |
|---|---|
clone/fork/vfork |
Process count enforcement |
mmap/munmap/brk/mremap |
Memory limit tracking |
connect/sendto/sendmsg |
IP allowlist + on-behalf execution |
bind |
On-behalf bind + port remapping |
openat |
/proc virtualization, COW interception |
unlinkat/mkdirat/renameat2 |
COW write interception |
execve/execveat |
policy_fn hold + vDSO re-patching |
getrandom |
Deterministic PRNG injection |
clock_nanosleep/timer_settime |
Timer adjustment for frozen time |
getdents64 |
PID filtering, COW directory merging |
getsockname |
Port remap translation |
COW Filesystem
Two modes of copy-on-write filesystem isolation:
Seccomp COW (default when workdir is set): Intercepts filesystem
syscalls via seccomp notification. Writes go to an upper directory;
reads resolve upper-then-lower. No mount namespace, no root. Committed
on exit, aborted on error.
OverlayFS COW: Uses kernel OverlayFS in a user namespace. Requires unprivileged user namespaces to be enabled.
COW Fork & Map-Reduce
Initialize expensive state once, then fork COW clones that share memory.
Each fork uses raw fork(2) (bypasses seccomp notification) for minimal
overhead. 1000 clones in ~530ms, ~1,900 forks/sec.
Each clone's stdout is captured via its own pipe. reduce() reads all
pipes and feeds combined output to a reducer's stdin — fully pipe-based
data flow with no temp files.
global ,
= # 2 GB, loaded once
=
=
# stdout → per-clone pipe
# Map: fork 4 clones with separate policies
=
=
# Reduce: pipe clone outputs to reducer stdin
=
# b"total\n"
let mut mapper = new_with_fns?;
let mut clones = mapper.fork.await?;
let reducer = new?;
let result = reducer.reduce.await?;
Map and reduce run in separate sandboxes with independent policies —
the mapper has data access, the reducer doesn't. Each clone inherits
Landlock + seccomp confinement. CLONE_ID=0..N-1 is set automatically.
Port Virtualization
Each sandbox gets a full virtual port space. Multiple sandboxes can bind
the same port without conflicts. The supervisor performs bind() on behalf
of the child via pidfd_getfd (TOCTOU-safe). When a port conflicts, a
different real port is allocated transparently. /proc/net/tcp is filtered
to only show the sandbox's own ports.
Performance
Benchmarked on a typical Linux workstation:
| Workload | Bare metal | Sandlock | Docker | Sandlock overhead |
|---|---|---|---|---|
/bin/echo startup |
2 ms | 7 ms | 307 ms | 5 ms (44x faster than Docker) |
| Redis SET (100K ops) | 82K rps | 80K rps | 52K rps | 97.1% of bare metal |
| Redis GET (100K ops) | 79K rps | 77K rps | 53K rps | 97.1% of bare metal |
| Redis p99 latency | 0.5 ms | 0.6 ms | 1.5 ms | ~2.5x lower than Docker |
| COW fork ×1000 | — | 530 ms | — | 530μs/fork, ~1,900 forks/sec |
Testing
# Rust tests
# Python tests
&& &&
Policy Reference