llmux
LLM multiplexer for vLLM. Host multiple models on a single GPU, switching between them on demand using vLLM's sleep/wake API.
When a request arrives for a model that isn't currently loaded, llmux puts the
active model to sleep (freeing GPU memory) and wakes the requested model. The
OpenAI-compatible API stays up throughout - clients just change the model
field.
How it works
Client requests
|
+---------+
| llmux | port 3000 (OpenAI-compatible)
+---------+
/ \
[vLLM:8001] [vLLM:8002]
(active) (sleeping)
llmux spawns vLLM processes lazily on first request and manages their lifecycle. Only one model is active at a time - the rest are sleeping (weights offloaded to CPU or discarded) or stopped entirely.
Sleep levels
| Level | Sleep | Wake | GPU freed | CPU RAM | State preserved | Use case |
|---|---|---|---|---|---|---|
| L1 | Slow (offload to CPU) | Fast (~1s) | Most | High (holds weights) | Partial | Model you expect to return to soon |
| L2 | Fast (~1s) | Slow (reload from disk) | Most | None | No (KV cache, CUDA graphs lost) | Model you may not need for a while |
| L3 (CUDA suspend) | Fast (~3s, ~15s at TP=2) | Fast (~3s, ~7s at TP=2) | All (100%) | High (holds VRAM) | Full | Like L1, but frees 100% GPU |
| L4 (CRIU) | ~27s (checkpoint to disk) | ~15s (restore) | All (100%) | None | Full (KV cache, CUDA graphs, allocator) | Many models; lowest resource usage while sleeping |
| L5 | Kill process | Cold start | All | None | No | Fallback / cleanup |
If L1-L4 sleep fails, llmux automatically escalates to L5 (kill) to guarantee GPU memory is freed.
CUDA suspend (level 3)
Uses cuda-checkpoint --toggle to suspend CUDA state and copy VRAM to host
RAM. The process stays alive — no serialization, no CRIU. Wake is just another
toggle to copy state back to GPU.
Like L1, this holds state in CPU RAM. Unlike L1, it frees 100% of GPU memory (L1 keeps ~500 MiB for CUDA context) and preserves full state.
For TP>1, llmux coordinates NCCL teardown before checkpoint and rebuild after
restore. This requires patched vLLM with suspend_nccl/resume_nccl support
(included in the Docker image).
Requirements:
cuda-checkpointutility (included in Docker image)- Root access (or passwordless
sudoforcuda-checkpoint) - For TP>1:
--enforce-eagerand--disable-custom-all-reduceinextra_args
CRIU checkpoint (level 4)
CRIU checkpointing uses cuda-checkpoint and criu to snapshot the entire
vLLM process tree to disk, then kill it. On restore, CRIU brings the process
back with all state intact — including GPU VRAM contents, KV cache, CUDA
graphs, and the warmed memory allocator. First inference after restore is ~30ms
(no warmup needed).
Requirements:
- CRIU 4.x with the CUDA plugin (
libcuda_plugin.so) (included in Docker image) cuda-checkpointutility (included in Docker image)- Root access (or passwordless
sudoforcriuandcuda-checkpoint) - vLLM process must not use
io_uringorlibuv(set automatically) - For TP>1:
--enforce-eagerand--disable-custom-all-reduceinextra_args
Trade-offs vs L1/L2:
- Slower sleep/wake than L1, but no CPU RAM cost and full state preservation
- Slower wake than L1 but faster first-inference (no warmup)
Quickstart
Create a config.json:
With Docker (recommended)
The Docker image bundles vLLM v0.15.1 with patches for NCCL suspend/resume and sleep mode fixes:
For sleep levels 3 and 4 (cuda-checkpoint/CRIU), additional flags are required:
The extra flags are needed because:
--privileged— CRIU requires broad namespace and ptrace access--pid=host— cuda-checkpoint needs to ptrace vLLM worker PIDs--ipc=host— NCCL uses shared memory for inter-GPU communication-v /tmp/llmux-checkpoints:...— CRIU checkpoints can be tens of GB; mount a host volume to avoid filling the container filesystem
Important: Do NOT use --init with CRIU (sleep level 4). Docker's init process (tini)
redirects stdin to the host's /dev/null, whose mount ID is invisible inside the container.
CRIU dump fails with "Can't lookup mount=N for fd=0 path=/dev/null".
From source
Requires vLLM installed and available as vllm on PATH:
For sleep levels 3 and 4, you also need cuda-checkpoint and criu (with
CUDA plugin) installed and either run as root or configure passwordless sudo.
Send requests
# First request starts vLLM for qwen-14b
# Switching: sleeps qwen-14b, starts gemma-12b
Configuration
Model options
| Field | Default | Description |
|---|---|---|
model_path |
required | HuggingFace model ID or local path |
port |
required | Port for this model's vLLM instance |
sleep_level |
5 |
Sleep level (1-2: vLLM, 3: CUDA suspend, 4: CRIU, 5: stop) |
extra_args |
[] |
Additional vLLM CLI arguments |
All vLLM-specific flags (e.g. --gpu-memory-utilization, --tensor-parallel-size,
--dtype) should be passed via extra_args:
Tensor parallelism with L3/L4
When using sleep levels 3 or 4 with TP>1, you must include --enforce-eager
and --disable-custom-all-reduce in extra_args:
--enforce-eager— CUDA graphs hold stale NCCL handles and crash on resume--disable-custom-all-reduce— CustomAllReduce IPC buffers cannot survive cuda-checkpoint
llmux validates the config at startup and warns if these flags are missing.
Top-level options
| Field | Default | Description |
|---|---|---|
port |
3000 |
Proxy listen port |
metrics_port |
9090 |
Prometheus metrics port (0 to disable) |
vllm_command |
"vllm" |
vLLM binary path |
Checkpoint config (for sleep levels 3 and 4)
To use cuda-checkpoint or CRIU, add a checkpoint section to your config:
For CRIU checkpointing (level 4), the full config is:
| Field | Default | Description |
|---|---|---|
criu_path |
"criu" |
Path to the criu binary |
cuda_plugin_dir |
"/usr/lib/criu/" |
Directory containing libcuda_plugin.so |
images_dir |
"/tmp/llmux-checkpoints" |
Base directory for checkpoint images |
cuda_checkpoint_path |
"cuda-checkpoint" |
Path to the cuda-checkpoint utility |
vLLM logging
vLLM process output (stdout/stderr) is always captured and forwarded to the
vllm tracing target at debug level. Use RUST_LOG to control visibility:
# Default: only llmux info logs, vLLM output hidden
# Show vLLM output
RUST_LOG=info,vllm=debug
# --verbose includes vLLM output automatically
ANSI color codes are stripped from vLLM output. The NO_COLOR=1 environment
variable is also set on spawned vLLM processes.
Policy options
| Field | Default | Description |
|---|---|---|
policy_type |
"fifo" |
Switching policy |
request_timeout_secs |
60 |
Request timeout |
drain_before_switch |
true |
Wait for in-flight requests before sleeping |
sleep_level |
5 |
Default sleep level for policy |
Validation
llmux includes a built-in validation tool that tests sleep/wake cycles against a running model, verifying GPU memory is freed and responses are deterministic after wake:
Output:
Level Sleep (s) Wake (s) GPU Before GPU After GPU Wake Response Pass
----------------------------------------------------------------------------------------
L1 35.9 1.2 45899 MiB 1341 MiB 44033 MiB match OK
L2 0.3 8.2 44033 MiB 1341 MiB 44033 MiB match OK
Result: ALL PASSED
Docker Compose
Basic setup
services:
llmux:
image: ghcr.io/doublewordai/llmux:latest
init: true
command:
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./config.json:/etc/llmux/config.json:ro
ports:
- "3000:3000"
deploy:
resources:
reservations:
devices:
- capabilities:
With cuda-checkpoint/CRIU (sleep levels 3 and 4)
services:
llmux:
image: ghcr.io/doublewordai/llmux:latest
init: true
command:
pid: host
ipc: host
cap_add:
- SYS_PTRACE
- CHECKPOINT_RESTORE
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./config.json:/etc/llmux/config.json:ro
ports:
- "3000:3000"
deploy:
resources:
reservations:
devices:
- capabilities:
With onwards (API key auth)
For production, put onwards in front for API key authentication:
services:
llmux:
image: ghcr.io/doublewordai/llmux:latest
init: true
command:
pid: host
ipc: host
cap_add:
- SYS_PTRACE
- CHECKPOINT_RESTORE
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./config.json:/etc/llmux/config.json:ro
deploy:
resources:
reservations:
devices:
- capabilities:
onwards:
image: ghcr.io/doublewordai/onwards:latest
command:
volumes:
- ./targets.json:/etc/onwards/targets.json:ro
ports:
- "3000:3000"
Where targets.json maps model names to llmux with API keys:
Tensor parallelism
All sleep levels work with TP>1:
| Level | TP>1 | Notes |
|---|---|---|
| L1 | Yes | vLLM manages NCCL teardown/rebuild internally |
| L2 | Yes | Same — vLLM handles it |
| L3 | Yes | llmux tears down NCCL before cuda-checkpoint, rebuilds after restore |
| L4 | Yes | Same — NCCL teardown before checkpoint, rebuild after restore |
| L5 | Yes | Kill + cold restart always works |
For L3 and L4, llmux uses vLLM's /collective_rpc endpoint to call
suspend_nccl (before cuda-checkpoint) and resume_nccl (after restore)
on all TP workers. This tears down NCCL IPC handles that cuda-checkpoint
cannot checkpoint, then rebuilds them after CUDA state is restored.
This requires patched vLLM with suspend_nccl/resume_nccl support. The
Docker image includes these patches. For bare-metal installs, apply
patches/nccl-suspend-resume-v0.15.1.patch to your vLLM installation.
Known issues
The --validate flag exists specifically to catch these kinds of problems
before they hit production.
vLLM v0.14+ sleep regression
Sleep mode (L1/L2) has a regression where weights are not discarded from GPU
memory (vllm#32714).
The Docker image includes a patch (fix-sleep-mode-v0.15.1.patch) that fixes
this. For bare-metal installs, apply the patch or use L3/L5 instead.
vLLM v0.13.0
openai/gpt-oss-20bL2 reload fails. The MXFP4 weight loader crashes on wake withdefault_weight_loader() got an unexpected keyword argument 'weight_name'. L1 works fine (19.6s sleep, 0.6s wake). Use L1 for this model.- L1 and L2 both work correctly for
Qwen/Qwen3-14Bandgoogle/gemma-3-12b-it.
NVIDIA driver requirements
The Docker image uses vLLM v0.15.1 which requires CUDA 12.9 and
nvidia-driver-580 or later. Check your driver version with nvidia-smi.
Compatibility
- L1/L2 sleep: Works with vLLM v0.13.x out of the box. Works with v0.15.1 with the sleep fix patch (included in Docker image).
- L3 CUDA suspend / L4 CRIU checkpoint: Works with vLLM v0.13.x+ with NCCL patches (included in Docker image). Requires
cuda-checkpointand CRIU (included in Docker image). - TP>1 with L3/L4: Requires vLLM NCCL suspend/resume patches (included in Docker image) plus
--enforce-eagerand--disable-custom-all-reduceflags.
License
MIT