# llmux
[](https://crates.io/crates/llmux)
[](https://github.com/doublewordai/llmux)
LLM multiplexer for vLLM. Host multiple models on a single GPU, switching
between them on demand using vLLM's sleep/wake API.
When a request arrives for a model that isn't currently loaded, llmux puts the
active model to sleep (freeing GPU memory) and wakes the requested model. The
OpenAI-compatible API stays up throughout - clients just change the `model`
field.
## How it works
```
Client requests
|
+---------+
| llmux | port 3000 (OpenAI-compatible)
+---------+
/ \
[vLLM:8001] [vLLM:8002]
(active) (sleeping)
```
llmux spawns vLLM processes lazily on first request and manages their
lifecycle. Only one model is active at a time - the rest are sleeping
(weights offloaded to CPU or discarded) or stopped entirely.
### Sleep levels
| **L1** | Slow (offload to CPU) | Fast (~1s) | Most | High (holds weights) | Partial | Model you expect to return to soon |
| **L2** | Fast (~1s) | Slow (reload from disk) | Most | None | No (KV cache, CUDA graphs lost) | Model you may not need for a while |
| **L3** (CUDA suspend) | Fast (~3s, ~15s at TP=2) | Fast (~3s, ~7s at TP=2) | All (100%) | High (holds VRAM) | Full | Like L1, but frees 100% GPU |
| **L4** (CRIU) | ~27s (checkpoint to disk) | ~15s (restore) | All (100%) | None | Full (KV cache, CUDA graphs, allocator) | Many models; lowest resource usage while sleeping |
| **L5** | Kill process | Cold start | All | None | No | Fallback / cleanup |
If L1-L4 sleep fails, llmux automatically escalates to L5 (kill) to guarantee
GPU memory is freed.
#### CUDA suspend (level 3)
Uses `cuda-checkpoint --toggle` to suspend CUDA state and copy VRAM to host
RAM. The process stays alive — no serialization, no CRIU. Wake is just another
toggle to copy state back to GPU.
Like L1, this holds state in CPU RAM. Unlike L1, it frees 100% of GPU memory
(L1 keeps ~500 MiB for CUDA context) and preserves full state.
For TP>1, llmux coordinates NCCL teardown before checkpoint and rebuild after
restore. This requires patched vLLM with `suspend_nccl`/`resume_nccl` support
(included in the Docker image).
**Requirements:**
- `cuda-checkpoint` utility (included in Docker image)
- Root access (or passwordless `sudo` for `cuda-checkpoint`)
- For TP>1: `--enforce-eager` and `--disable-custom-all-reduce` in `extra_args`
#### CRIU checkpoint (level 4)
CRIU checkpointing uses `cuda-checkpoint` and `criu` to snapshot the entire
vLLM process tree to disk, then kill it. On restore, CRIU brings the process
back with all state intact — including GPU VRAM contents, KV cache, CUDA
graphs, and the warmed memory allocator. First inference after restore is ~30ms
(no warmup needed).
**Requirements:**
- CRIU 4.x with the CUDA plugin (`libcuda_plugin.so`) (included in Docker image)
- `cuda-checkpoint` utility (included in Docker image)
- Root access (or passwordless `sudo` for `criu` and `cuda-checkpoint`)
- vLLM process must not use `io_uring` or `libuv` (set automatically)
- For TP>1: `--enforce-eager` and `--disable-custom-all-reduce` in `extra_args`
**Trade-offs vs L1/L2:**
- Slower sleep/wake than L1, but no CPU RAM cost and full state preservation
- Slower wake than L1 but faster first-inference (no warmup)
## Quickstart
Create a `config.json`:
```json
{
"models": {
"qwen-14b": {
"model_path": "Qwen/Qwen3-14B",
"port": 8001,
"sleep_level": 1
},
"gemma-12b": {
"model_path": "google/gemma-3-12b-it",
"port": 8002,
"sleep_level": 2
}
},
"port": 3000
}
```
### With Docker (recommended)
The Docker image bundles vLLM v0.15.1 with patches for NCCL suspend/resume
and sleep mode fixes:
```bash
docker run --gpus all --init \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v ./config.json:/etc/llmux/config.json:ro \
-p 3000:3000 \
ghcr.io/doublewordai/llmux:latest
```
For sleep levels 3 and 4 (cuda-checkpoint/CRIU), additional flags are required:
```bash
docker run --gpus all \
--privileged \
--pid=host \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v ./config.json:/etc/llmux/config.json:ro \
-v /tmp/llmux-checkpoints:/tmp/llmux-checkpoints \
-p 3000:3000 \
ghcr.io/doublewordai/llmux:latest
```
The extra flags are needed because:
- `--privileged` — CRIU requires broad namespace and ptrace access
- `--pid=host` — cuda-checkpoint needs to ptrace vLLM worker PIDs
- `--ipc=host` — NCCL uses shared memory for inter-GPU communication
- `-v /tmp/llmux-checkpoints:...` — CRIU checkpoints can be tens of GB; mount a host volume to avoid filling the container filesystem
**Important:** Do NOT use `--init` with CRIU (sleep level 4). Docker's init process (tini)
redirects stdin to the host's `/dev/null`, whose mount ID is invisible inside the container.
CRIU dump fails with "Can't lookup mount=N for fd=0 path=/dev/null".
### From source
Requires vLLM installed and available as `vllm` on PATH:
```bash
cargo install llmux
llmux --config config.json
```
For sleep levels 3 and 4, you also need `cuda-checkpoint` and `criu` (with
CUDA plugin) installed and either run as root or configure passwordless sudo.
### Send requests
```bash
# First request starts vLLM for qwen-14b
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen-14b", "messages": [{"role": "user", "content": "Hello"}]}'
# Switching: sleeps qwen-14b, starts gemma-12b
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "gemma-12b", "messages": [{"role": "user", "content": "Hello"}]}'
```
## Configuration
### Model options
| `model_path` | *required* | HuggingFace model ID or local path |
| `port` | *required* | Port for this model's vLLM instance |
| `sleep_level` | `5` | Sleep level (1-2: vLLM, 3: CUDA suspend, 4: CRIU, 5: stop) |
| `extra_args` | `[]` | Additional vLLM CLI arguments |
All vLLM-specific flags (e.g. `--gpu-memory-utilization`, `--tensor-parallel-size`,
`--dtype`) should be passed via `extra_args`:
```json
{
"model_path": "Qwen/Qwen3-14B",
"port": 8001,
"extra_args": ["--gpu-memory-utilization", "0.9", "--tensor-parallel-size", "2"]
}
```
#### Tensor parallelism with L3/L4
When using sleep levels 3 or 4 with TP>1, you **must** include `--enforce-eager`
and `--disable-custom-all-reduce` in `extra_args`:
```json
{
"model_path": "NousResearch/Meta-Llama-3.1-8B-Instruct",
"port": 8001,
"sleep_level": 3,
"extra_args": [
"--tensor-parallel-size", "2",
"--enforce-eager",
"--disable-custom-all-reduce",
"--gpu-memory-utilization", "0.85"
]
}
```
- `--enforce-eager` — CUDA graphs hold stale NCCL handles and crash on resume
- `--disable-custom-all-reduce` — CustomAllReduce IPC buffers cannot survive cuda-checkpoint
llmux validates the config at startup and warns if these flags are missing.
### Top-level options
| `port` | `3000` | Proxy listen port |
| `metrics_port` | `9090` | Prometheus metrics port (0 to disable) |
| `vllm_command` | `"vllm"` | vLLM binary path |
### Checkpoint config (for sleep levels 3 and 4)
To use cuda-checkpoint or CRIU, add a `checkpoint` section to your config:
```json
{
"models": {
"qwen-14b": {
"model_path": "Qwen/Qwen3-14B",
"port": 8001,
"sleep_level": 3
}
},
"checkpoint": {
"cuda_checkpoint_path": "cuda-checkpoint"
}
}
```
For CRIU checkpointing (level 4), the full config is:
```json
{
"checkpoint": {
"criu_path": "criu",
"cuda_plugin_dir": "/usr/lib/criu/",
"images_dir": "/tmp/llmux-checkpoints",
"cuda_checkpoint_path": "cuda-checkpoint"
}
}
```
| `criu_path` | `"criu"` | Path to the criu binary |
| `cuda_plugin_dir` | `"/usr/lib/criu/"` | Directory containing `libcuda_plugin.so` |
| `images_dir` | `"/tmp/llmux-checkpoints"` | Base directory for checkpoint images |
| `cuda_checkpoint_path` | `"cuda-checkpoint"` | Path to the cuda-checkpoint utility |
### vLLM logging
vLLM process output (stdout/stderr) is always captured and forwarded to the
`vllm` tracing target at `debug` level. Use `RUST_LOG` to control visibility:
```bash
# Default: only llmux info logs, vLLM output hidden
llmux --config config.json
# Show vLLM output
RUST_LOG=info,vllm=debug llmux --config config.json
# --verbose includes vLLM output automatically
llmux --config config.json --verbose
```
ANSI color codes are stripped from vLLM output. The `NO_COLOR=1` environment
variable is also set on spawned vLLM processes.
### Policy options
| `policy_type` | `"fifo"` | Switching policy |
| `request_timeout_secs` | `60` | Request timeout |
| `drain_before_switch` | `true` | Wait for in-flight requests before sleeping |
| `sleep_level` | `5` | Default sleep level for policy |
## Validation
llmux includes a built-in validation tool that tests sleep/wake cycles
against a running model, verifying GPU memory is freed and responses are
deterministic after wake:
```bash
llmux --config config.json --validate qwen-14b --levels 1,2,3,4 --verbose
```
Output:
```
Level Sleep (s) Wake (s) GPU Before GPU After GPU Wake Response Pass
----------------------------------------------------------------------------------------
L1 35.9 1.2 45899 MiB 1341 MiB 44033 MiB match OK
L2 0.3 8.2 44033 MiB 1341 MiB 44033 MiB match OK
Result: ALL PASSED
```
## Docker Compose
### Basic setup
```yaml
services:
llmux:
image: ghcr.io/doublewordai/llmux:latest
init: true
command: ["--config", "/etc/llmux/config.json"]
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./config.json:/etc/llmux/config.json:ro
ports:
- "3000:3000"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
```
### With cuda-checkpoint/CRIU (sleep levels 3 and 4)
```yaml
services:
llmux:
image: ghcr.io/doublewordai/llmux:latest
init: true
command: ["--config", "/etc/llmux/config.json"]
pid: host
ipc: host
cap_add:
- SYS_PTRACE
- CHECKPOINT_RESTORE
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./config.json:/etc/llmux/config.json:ro
ports:
- "3000:3000"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
```
### With onwards (API key auth)
For production, put [onwards](https://github.com/doublewordai/onwards) in
front for API key authentication:
```yaml
services:
llmux:
image: ghcr.io/doublewordai/llmux:latest
init: true
command: ["--config", "/etc/llmux/config.json"]
pid: host
ipc: host
cap_add:
- SYS_PTRACE
- CHECKPOINT_RESTORE
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./config.json:/etc/llmux/config.json:ro
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
onwards:
image: ghcr.io/doublewordai/onwards:latest
command: ["--targets", "/etc/onwards/targets.json"]
volumes:
- ./targets.json:/etc/onwards/targets.json:ro
ports:
- "3000:3000"
```
Where `targets.json` maps model names to llmux with API keys:
```json
{
"targets": {
"qwen-14b": {
"url": "http://llmux:3000/v1",
"keys": ["sk-your-api-key"],
"onwards_model": "qwen-14b"
},
"gemma-12b": {
"url": "http://llmux:3000/v1",
"keys": ["sk-your-api-key"],
"onwards_model": "gemma-12b"
}
}
}
```
## Tensor parallelism
All sleep levels work with TP>1:
| L1 | Yes | vLLM manages NCCL teardown/rebuild internally |
| L2 | Yes | Same — vLLM handles it |
| L3 | Yes | llmux tears down NCCL before cuda-checkpoint, rebuilds after restore |
| L4 | Yes | Same — NCCL teardown before checkpoint, rebuild after restore |
| L5 | Yes | Kill + cold restart always works |
For L3 and L4, llmux uses vLLM's `/collective_rpc` endpoint to call
`suspend_nccl` (before cuda-checkpoint) and `resume_nccl` (after restore)
on all TP workers. This tears down NCCL IPC handles that cuda-checkpoint
cannot checkpoint, then rebuilds them after CUDA state is restored.
This requires patched vLLM with `suspend_nccl`/`resume_nccl` support. The
Docker image includes these patches. For bare-metal installs, apply
`patches/nccl-suspend-resume-v0.15.1.patch` to your vLLM installation.
## Known issues
The `--validate` flag exists specifically to catch these kinds of problems
before they hit production.
### vLLM v0.14+ sleep regression
Sleep mode (L1/L2) has a regression where weights are not discarded from GPU
memory ([vllm#32714](https://github.com/vllm-project/vllm/issues/32714)).
The Docker image includes a patch (`fix-sleep-mode-v0.15.1.patch`) that fixes
this. For bare-metal installs, apply the patch or use L3/L5 instead.
### vLLM v0.13.0
- **`openai/gpt-oss-20b` L2 reload fails.** The MXFP4 weight loader crashes on
wake with `default_weight_loader() got an unexpected keyword argument
'weight_name'`. L1 works fine (19.6s sleep, 0.6s wake). Use L1 for this
model.
- L1 and L2 both work correctly for `Qwen/Qwen3-14B` and
`google/gemma-3-12b-it`.
### NVIDIA driver requirements
The Docker image uses vLLM v0.15.1 which requires CUDA 12.9 and
nvidia-driver-580 or later. Check your driver version with `nvidia-smi`.
## Compatibility
- **L1/L2 sleep:** Works with vLLM v0.13.x out of the box. Works with v0.15.1 with the sleep fix patch (included in Docker image).
- **L3 CUDA suspend / L4 CRIU checkpoint:** Works with vLLM v0.13.x+ with NCCL patches (included in Docker image). Requires `cuda-checkpoint` and CRIU (included in Docker image).
- **TP>1 with L3/L4:** Requires vLLM NCCL suspend/resume patches (included in Docker image) plus `--enforce-eager` and `--disable-custom-all-reduce` flags.
## License
MIT