llmux
LLM multiplexer for vLLM. Host multiple models on a single GPU, switching between them on demand using vLLM's sleep/wake API.
When a request arrives for a model that isn't currently loaded, llmux puts the
active model to sleep (freeing GPU memory) and wakes the requested model. The
OpenAI-compatible API stays up throughout - clients just change the model
field.
How it works
Client requests
|
+---------+
| llmux | port 3000 (OpenAI-compatible)
+---------+
/ \
[vLLM:8001] [vLLM:8002]
(active) (sleeping)
llmux spawns vLLM processes lazily on first request and manages their lifecycle. Only one model is active at a time - the rest are evicted using a configurable two-axis policy.
Eviction policy
When a model is evicted, llmux applies two strategies in sequence:
- Weight strategy — what vLLM does with model weights (via the sleep API)
- Process strategy — what happens to the OS process afterward
Weight strategy
Applied first, via vLLM's sleep endpoint. Controls what happens to weights and KV cache:
| Strategy | Description |
|---|---|
retain |
Nothing happens. Weights and KV cache stay on GPU. |
offload |
Weights copied to pinned CPU RAM. KV cache and CUDA graphs discarded. Frees most GPU memory but uses significant host RAM. |
discard |
Weights dropped entirely (reloaded from disk on wake). KV cache and CUDA graphs discarded. Frees most GPU memory with no CPU RAM cost. |
Both offload and discard leave a small CUDA context (~500 MiB) on the GPU.
Process strategy
Applied second, to the process left behind by the weight strategy:
| Strategy | Description |
|---|---|
keep_running |
Process stays as-is. Fast, but whatever's still on GPU stays there. |
cuda_suspend |
Snapshots remaining VRAM to host RAM via cuda-checkpoint, freeing 100% of GPU memory. Process stays alive. |
checkpoint |
CRIU dumps the entire process (including host RAM) to disk, then kills it. Frees 100% of GPU and CPU memory. |
stop |
Kills the process. Everything is lost. |
How they interact
The weight strategy determines what's on GPU vs. CPU before the process strategy runs. This affects speed, memory usage, and what survives a wake cycle:
keep_running |
cuda_suspend |
checkpoint |
stop |
|
|---|---|---|---|---|
retain |
No-op (model stays loaded) | Full VRAM snapshot to CPU — weights, KV cache, CUDA graphs all preserved | Full process checkpoint to disk — large image (includes VRAM) | Everything lost |
offload |
Weights on CPU, KV lost, ~500 MiB CUDA context remains on GPU | Remaining CUDA context → CPU, weights already on CPU | Large CRIU image (weights in host RAM get written to disk) | Everything lost |
discard |
Weights gone, KV lost, ~500 MiB CUDA context remains on GPU | Remaining CUDA context → CPU, weights gone | Small CRIU image (no weights — reloaded from HF cache on wake) | Everything lost |
Common choices:
offload+keep_running— Fast wake (weights already in RAM), but holds CPU memory and ~500 MiB GPUdiscard+keep_running— No CPU RAM cost, but slow wake (reload from disk) and ~500 MiB GPUretain+cuda_suspend— Frees 100% GPU, full state preserved, but holds all VRAM in CPU RAMdiscard+checkpoint— Frees 100% GPU and CPU, small CRIU image; wake reloads weights from disk but restores KV cache, CUDA graphs, and warmed allocator from checkpointoffload+checkpoint— Like above but CRIU image is large (includes weights); wake is faster (no disk reload) but checkpoint is slower and uses more disk
If eviction fails, llmux automatically escalates to stop to guarantee GPU
memory is freed.
CUDA suspend
Uses cuda-checkpoint --toggle to suspend CUDA state and copy VRAM to host
RAM. The process stays alive — no serialization, no CRIU. Wake is just another
toggle to copy state back to GPU.
Like offload, this holds state in CPU RAM. Unlike offload, it frees 100% of
GPU memory (offload keeps ~500 MiB for CUDA context) and preserves full state.
For TP>1, llmux coordinates NCCL teardown before checkpoint and rebuild after
restore. This requires patched vLLM with suspend_nccl/resume_nccl support
(included in the Docker image).
Requirements:
cuda-checkpointutility (included in Docker image)- Root access (or passwordless
sudoforcuda-checkpoint) - For TP>1:
--enforce-eagerand--disable-custom-all-reduceinextra_args
CRIU checkpoint
CRIU checkpointing uses cuda-checkpoint and criu to snapshot the entire
vLLM process tree to disk, then kill it. On restore, CRIU brings the process
back with all state intact — including GPU VRAM contents, KV cache, CUDA
graphs, and the warmed memory allocator. First inference after restore is ~30ms
(no warmup needed).
Requirements:
- CRIU 4.x with the CUDA plugin (
libcuda_plugin.so) (included in Docker image) cuda-checkpointutility (included in Docker image)- Root access (or passwordless
sudoforcriuandcuda-checkpoint) - vLLM process must not use
io_uringorlibuv(set automatically) - For TP>1:
--enforce-eagerand--disable-custom-all-reduceinextra_args
Trade-offs vs offload:
- Slower sleep/wake than
offload, but no CPU RAM cost and full state preservation - Slower wake than
offloadbut faster first-inference (no warmup)
Quickstart
Create a config.json:
With Docker (recommended)
The Docker image bundles vLLM v0.15.1 with patches for NCCL suspend/resume and sleep mode fixes:
For cuda_suspend or checkpoint process strategies, additional flags are required:
The extra flags are needed because:
--privileged— CRIU requires broad namespace and ptrace access--pid=host— cuda-checkpoint needs to ptrace vLLM worker PIDs--ipc=host— NCCL uses shared memory for inter-GPU communication-v /tmp/llmux-checkpoints:...— CRIU checkpoints can be tens of GB; mount a host volume to avoid filling the container filesystem
Important: Do NOT use --init with CRIU (checkpoint process strategy). Docker's init process (tini)
redirects stdin to the host's /dev/null, whose mount ID is invisible inside the container.
CRIU dump fails with "Can't lookup mount=N for fd=0 path=/dev/null".
From source
Requires vLLM installed and available as vllm on PATH:
For cuda_suspend or checkpoint process strategies, you also need cuda-checkpoint
and criu (with CUDA plugin) installed and either run as root or configure passwordless sudo.
Send requests
# First request starts vLLM for qwen-14b
# Switching: sleeps qwen-14b, starts gemma-12b
Configuration
Model options
| Field | Default | Description |
|---|---|---|
model_path |
required | HuggingFace model ID or local path |
port |
required | Port for this model's vLLM instance |
eviction |
retain + stop |
Eviction policy (see below) |
extra_args |
[] |
Additional vLLM CLI arguments |
checkpoint_path |
none | Path to CRIU checkpoint images for lazy restore on first request |
The eviction field takes an object with weights and process keys:
All vLLM-specific flags (e.g. --gpu-memory-utilization, --tensor-parallel-size,
--dtype) should be passed via extra_args:
Tensor parallelism with cuda_suspend/checkpoint
When using cuda_suspend or checkpoint with TP>1, you must include --enforce-eager
and --disable-custom-all-reduce in extra_args:
--enforce-eager— CUDA graphs hold stale NCCL handles and crash on resume--disable-custom-all-reduce— CustomAllReduce IPC buffers cannot survive cuda-checkpoint
llmux validates the config at startup and warns if these flags are missing.
Top-level options
| Field | Default | Description |
|---|---|---|
port |
3000 |
Proxy listen port |
metrics_port |
9090 |
Prometheus metrics port (0 to disable) |
vllm_command |
"vllm" |
vLLM binary path |
Checkpoint config
To use cuda_suspend or checkpoint process strategies, add a checkpoint section:
For CRIU checkpointing, the full config is:
| Field | Default | Description |
|---|---|---|
criu_path |
"criu" |
Path to the criu binary |
cuda_plugin_dir |
"/usr/lib/criu/" |
Directory containing libcuda_plugin.so |
images_dir |
"/tmp/llmux-checkpoints" |
Base directory for checkpoint images |
cuda_checkpoint_path |
"cuda-checkpoint" |
Path to the cuda-checkpoint utility |
vLLM logging
vLLM process output (stdout/stderr) is always captured and forwarded to the
vllm tracing target at debug level. Use RUST_LOG to control visibility:
# Default: only llmux info logs, vLLM output hidden
# Show vLLM output
RUST_LOG=info,vllm=debug
# --verbose includes vLLM output automatically
ANSI color codes are stripped from vLLM output. The NO_COLOR=1 environment
variable is also set on spawned vLLM processes.
Policy options
| Field | Default | Description |
|---|---|---|
policy_type |
"fifo" |
Switching policy |
request_timeout_secs |
60 |
Request timeout |
drain_before_switch |
true |
Wait for in-flight requests before sleeping |
eviction |
retain + stop |
Default eviction policy |
Validation
llmux includes a built-in validation tool that tests sleep/wake cycles against a running model, verifying GPU memory is freed and responses are deterministic after wake:
Output:
Eviction Sleep (s) Wake (s) GPU Before GPU After GPU Wake Response Pass
------------------------------------------------------------------------------------------------
Offload+KeepRun 35.9 1.2 45899 MiB 1341 MiB 44033 MiB match OK
Discard+KeepRun 0.3 8.2 44033 MiB 1341 MiB 44033 MiB match OK
Result: ALL PASSED
Checkpoint management
Pre-create CRIU checkpoints for fast model switching:
# Create checkpoint (start model, warm up, CRIU dump to disk)
# Use a different weight strategy (affects CRIU image size)
# Skip warmup inference before checkpointing
# Restore detached (CRIU restore, health check, exit — process keeps running)
The default eviction for --checkpoint is discard+checkpoint, which produces
small CRIU images (weights are reloaded from the HF cache on restore). Use
retain+checkpoint or offload+checkpoint for larger images that restore
faster (weights already in the snapshot).
After --restore-detached, the vLLM process continues running on its configured
port. This is useful for testing checkpoints or running a single model without
the daemon.
Lazy restore via config
Instead of restoring manually, set checkpoint_path in your model config so
the daemon restores from checkpoint on first request:
When the daemon starts, models with checkpoint_path are initialized in
checkpointed state. The first request triggers a CRIU restore instead of a
cold start — typically 3-5x faster.
keep_images must be true (the default) in the checkpoint config when
using checkpoint_path, since the images must persist across daemon restarts.
Docker Compose
Basic setup
services:
llmux:
image: ghcr.io/doublewordai/llmux:latest
init: true
command:
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./config.json:/etc/llmux/config.json:ro
ports:
- "3000:3000"
deploy:
resources:
reservations:
devices:
- capabilities:
With cuda-checkpoint/CRIU
services:
llmux:
image: ghcr.io/doublewordai/llmux:latest
init: true
command:
pid: host
ipc: host
cap_add:
- SYS_PTRACE
- CHECKPOINT_RESTORE
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./config.json:/etc/llmux/config.json:ro
ports:
- "3000:3000"
deploy:
resources:
reservations:
devices:
- capabilities:
With onwards (API key auth)
For production, put onwards in front for API key authentication:
services:
llmux:
image: ghcr.io/doublewordai/llmux:latest
init: true
command:
pid: host
ipc: host
cap_add:
- SYS_PTRACE
- CHECKPOINT_RESTORE
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./config.json:/etc/llmux/config.json:ro
deploy:
resources:
reservations:
devices:
- capabilities:
onwards:
image: ghcr.io/doublewordai/onwards:latest
command:
volumes:
- ./targets.json:/etc/onwards/targets.json:ro
ports:
- "3000:3000"
Where targets.json maps model names to llmux with API keys:
Tensor parallelism
All eviction strategies work with TP>1:
| Strategy | TP>1 | Notes |
|---|---|---|
offload |
Yes | vLLM manages NCCL teardown/rebuild internally |
discard |
Yes | Same — vLLM handles it |
cuda_suspend |
Yes | llmux tears down NCCL before cuda-checkpoint, rebuilds after restore |
checkpoint |
Yes | Same — NCCL teardown before checkpoint, rebuild after restore |
stop |
Yes | Kill + cold restart always works |
For cuda_suspend and checkpoint, llmux uses vLLM's /collective_rpc endpoint to call
suspend_nccl (before cuda-checkpoint) and resume_nccl (after restore)
on all TP workers. This tears down NCCL IPC handles that cuda-checkpoint
cannot checkpoint, then rebuilds them after CUDA state is restored.
This requires patched vLLM with suspend_nccl/resume_nccl support. The
Docker image includes these patches. For bare-metal installs, apply
patches/nccl-suspend-resume-v0.15.1.patch to your vLLM installation.
Known issues
The --validate flag exists specifically to catch these kinds of problems
before they hit production.
vLLM v0.14+ sleep regression
Sleep mode (offload/discard) has a regression where weights are not discarded from GPU
memory (vllm#32714).
The Docker image includes a patch (fix-sleep-mode-v0.15.1.patch) that fixes
this. For bare-metal installs, apply the patch or use cuda_suspend/stop instead.
vLLM v0.13.0
openai/gpt-oss-20bdiscardreload fails. The MXFP4 weight loader crashes on wake withdefault_weight_loader() got an unexpected keyword argument 'weight_name'.offloadworks fine (19.6s sleep, 0.6s wake). Useoffloadfor this model.offloadanddiscardboth work correctly forQwen/Qwen3-14Bandgoogle/gemma-3-12b-it.
NVIDIA driver requirements
The Docker image uses vLLM v0.15.1 which requires CUDA 12.9 and
nvidia-driver-580 or later. Check your driver version with nvidia-smi.
Compatibility
offload/discard: Works with vLLM v0.13.x out of the box. Works with v0.15.1 with the sleep fix patch (included in Docker image).cuda_suspend/checkpoint: Works with vLLM v0.13.x+ with NCCL patches (included in Docker image). Requirescuda-checkpointand CRIU (included in Docker image).- TP>1 with
cuda_suspend/checkpoint: Requires vLLM NCCL suspend/resume patches (included in Docker image) plus--enforce-eagerand--disable-custom-all-reduceflags.
License
MIT