llmux 0.4.1

Zero-reload model switching for vLLM - manages multiple models on shared GPU
Documentation

llmux

Crates.io GitHub

LLM multiplexer for vLLM. Host multiple models on a single GPU, switching between them on demand using vLLM's sleep/wake API.

When a request arrives for a model that isn't currently loaded, llmux puts the active model to sleep (freeing GPU memory) and wakes the requested model. The OpenAI-compatible API stays up throughout - clients just change the model field.

How it works

                    Client requests
                         |
                    +---------+
                    |  llmux  |   port 3000 (OpenAI-compatible)
                    +---------+
                    /         \
            [vLLM:8001]    [vLLM:8002]
             (active)       (sleeping)

llmux spawns vLLM processes lazily on first request and manages their lifecycle. Only one model is active at a time - the rest are sleeping (weights offloaded to CPU or discarded) or stopped entirely.

Sleep levels

Level Sleep Wake GPU freed CPU RAM State preserved Use case
L1 Slow (offload to CPU) Fast (~1s) Most High (holds weights) Partial Model you expect to return to soon
L2 Fast (~1s) Slow (reload from disk) Most None No (KV cache, CUDA graphs lost) Model you may not need for a while
L3 (CUDA suspend) Fast (~3s, ~15s at TP=2) Fast (~3s, ~7s at TP=2) All (100%) High (holds VRAM) Full Like L1, but frees 100% GPU
L4 (CRIU) ~27s (checkpoint to disk) ~15s (restore) All (100%) None Full (KV cache, CUDA graphs, allocator) Many models; lowest resource usage while sleeping
L5 Kill process Cold start All None No Fallback / cleanup

If L1-L4 sleep fails, llmux automatically escalates to L5 (kill) to guarantee GPU memory is freed.

CUDA suspend (level 3)

Uses cuda-checkpoint --toggle to suspend CUDA state and copy VRAM to host RAM. The process stays alive — no serialization, no CRIU. Wake is just another toggle to copy state back to GPU.

Like L1, this holds state in CPU RAM. Unlike L1, it frees 100% of GPU memory (L1 keeps ~500 MiB for CUDA context) and preserves full state.

For TP>1, llmux coordinates NCCL teardown before checkpoint and rebuild after restore. This requires patched vLLM with suspend_nccl/resume_nccl support (included in the Docker image).

Requirements:

  • cuda-checkpoint utility (included in Docker image)
  • Root access (or passwordless sudo for cuda-checkpoint)
  • For TP>1: --enforce-eager and --disable-custom-all-reduce in extra_args

CRIU checkpoint (level 4)

CRIU checkpointing uses cuda-checkpoint and criu to snapshot the entire vLLM process tree to disk, then kill it. On restore, CRIU brings the process back with all state intact — including GPU VRAM contents, KV cache, CUDA graphs, and the warmed memory allocator. First inference after restore is ~30ms (no warmup needed).

Requirements:

  • CRIU 4.x with the CUDA plugin (libcuda_plugin.so) (included in Docker image)
  • cuda-checkpoint utility (included in Docker image)
  • Root access (or passwordless sudo for criu and cuda-checkpoint)
  • vLLM process must not use io_uring or libuv (set automatically)
  • For TP>1: --enforce-eager and --disable-custom-all-reduce in extra_args

Trade-offs vs L1/L2:

  • Slower sleep/wake than L1, but no CPU RAM cost and full state preservation
  • Slower wake than L1 but faster first-inference (no warmup)

Quickstart

Create a config.json:

{
  "models": {
    "qwen-14b": {
      "model_path": "Qwen/Qwen3-14B",
      "port": 8001,
      "sleep_level": 1
    },
    "gemma-12b": {
      "model_path": "google/gemma-3-12b-it",
      "port": 8002,
      "sleep_level": 2
    }
  },
  "port": 3000
}

With Docker (recommended)

The Docker image bundles vLLM v0.15.1 with patches for NCCL suspend/resume and sleep mode fixes:

docker run --gpus all --init \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ./config.json:/etc/llmux/config.json:ro \
  -p 3000:3000 \
  ghcr.io/doublewordai/llmux:latest

For sleep levels 3 and 4 (cuda-checkpoint/CRIU), additional flags are required:

docker run --gpus all --init \
  --pid=host \
  --ipc=host \
  --cap-add SYS_PTRACE \
  --cap-add CHECKPOINT_RESTORE \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ./config.json:/etc/llmux/config.json:ro \
  -p 3000:3000 \
  ghcr.io/doublewordai/llmux:latest

The extra flags are needed because:

  • --pid=host — cuda-checkpoint needs to ptrace vLLM worker PIDs
  • --ipc=host — NCCL uses shared memory for inter-GPU communication
  • --cap-add SYS_PTRACE — cuda-checkpoint's ptrace calls
  • --cap-add CHECKPOINT_RESTORE — CRIU process checkpoint/restore

From source

Requires vLLM installed and available as vllm on PATH:

cargo install llmux
llmux --config config.json

For sleep levels 3 and 4, you also need cuda-checkpoint and criu (with CUDA plugin) installed and either run as root or configure passwordless sudo.

Send requests

# First request starts vLLM for qwen-14b
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen-14b", "messages": [{"role": "user", "content": "Hello"}]}'

# Switching: sleeps qwen-14b, starts gemma-12b
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gemma-12b", "messages": [{"role": "user", "content": "Hello"}]}'

Configuration

Model options

Field Default Description
model_path required HuggingFace model ID or local path
port required Port for this model's vLLM instance
sleep_level 5 Sleep level (1-2: vLLM, 3: CUDA suspend, 4: CRIU, 5: stop)
extra_args [] Additional vLLM CLI arguments

All vLLM-specific flags (e.g. --gpu-memory-utilization, --tensor-parallel-size, --dtype) should be passed via extra_args:

{
  "model_path": "Qwen/Qwen3-14B",
  "port": 8001,
  "extra_args": ["--gpu-memory-utilization", "0.9", "--tensor-parallel-size", "2"]
}

Tensor parallelism with L3/L4

When using sleep levels 3 or 4 with TP>1, you must include --enforce-eager and --disable-custom-all-reduce in extra_args:

{
  "model_path": "NousResearch/Meta-Llama-3.1-8B-Instruct",
  "port": 8001,
  "sleep_level": 3,
  "extra_args": [
    "--tensor-parallel-size", "2",
    "--enforce-eager",
    "--disable-custom-all-reduce",
    "--gpu-memory-utilization", "0.85"
  ]
}
  • --enforce-eager — CUDA graphs hold stale NCCL handles and crash on resume
  • --disable-custom-all-reduce — CustomAllReduce IPC buffers cannot survive cuda-checkpoint

llmux validates the config at startup and warns if these flags are missing.

Top-level options

Field Default Description
port 3000 Proxy listen port
metrics_port 9090 Prometheus metrics port (0 to disable)
vllm_command "vllm" vLLM binary path

Checkpoint config (for sleep levels 3 and 4)

To use cuda-checkpoint or CRIU, add a checkpoint section to your config:

{
  "models": {
    "qwen-14b": {
      "model_path": "Qwen/Qwen3-14B",
      "port": 8001,
      "sleep_level": 3
    }
  },
  "checkpoint": {
    "cuda_checkpoint_path": "cuda-checkpoint"
  }
}

For CRIU checkpointing (level 4), the full config is:

{
  "checkpoint": {
    "criu_path": "criu",
    "cuda_plugin_dir": "/usr/lib/criu/",
    "images_dir": "/tmp/llmux-checkpoints",
    "cuda_checkpoint_path": "cuda-checkpoint"
  }
}
Field Default Description
criu_path "criu" Path to the criu binary
cuda_plugin_dir "/usr/lib/criu/" Directory containing libcuda_plugin.so
images_dir "/tmp/llmux-checkpoints" Base directory for checkpoint images
cuda_checkpoint_path "cuda-checkpoint" Path to the cuda-checkpoint utility

vLLM logging

vLLM process output (stdout/stderr) is always captured and forwarded to the vllm tracing target at debug level. Use RUST_LOG to control visibility:

# Default: only llmux info logs, vLLM output hidden
llmux --config config.json

# Show vLLM output
RUST_LOG=info,vllm=debug llmux --config config.json

# --verbose includes vLLM output automatically
llmux --config config.json --verbose

ANSI color codes are stripped from vLLM output. The NO_COLOR=1 environment variable is also set on spawned vLLM processes.

Policy options

Field Default Description
policy_type "fifo" Switching policy
request_timeout_secs 60 Request timeout
drain_before_switch true Wait for in-flight requests before sleeping
sleep_level 5 Default sleep level for policy

Validation

llmux includes a built-in validation tool that tests sleep/wake cycles against a running model, verifying GPU memory is freed and responses are deterministic after wake:

llmux --config config.json --validate qwen-14b --levels 1,2,3,4 --verbose

Output:

Level     Sleep (s)   Wake (s)   GPU Before    GPU After     GPU Wake   Response   Pass
----------------------------------------------------------------------------------------
L1             35.9        1.2      45899 MiB       1341 MiB      44033 MiB      match     OK
L2              0.3        8.2      44033 MiB       1341 MiB      44033 MiB      match     OK

Result: ALL PASSED

Docker Compose

Basic setup

services:
  llmux:
    image: ghcr.io/doublewordai/llmux:latest
    init: true
    command: ["--config", "/etc/llmux/config.json"]
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./config.json:/etc/llmux/config.json:ro
    ports:
      - "3000:3000"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

With cuda-checkpoint/CRIU (sleep levels 3 and 4)

services:
  llmux:
    image: ghcr.io/doublewordai/llmux:latest
    init: true
    command: ["--config", "/etc/llmux/config.json"]
    pid: host
    ipc: host
    cap_add:
      - SYS_PTRACE
      - CHECKPOINT_RESTORE
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./config.json:/etc/llmux/config.json:ro
    ports:
      - "3000:3000"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

With onwards (API key auth)

For production, put onwards in front for API key authentication:

services:
  llmux:
    image: ghcr.io/doublewordai/llmux:latest
    init: true
    command: ["--config", "/etc/llmux/config.json"]
    pid: host
    ipc: host
    cap_add:
      - SYS_PTRACE
      - CHECKPOINT_RESTORE
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./config.json:/etc/llmux/config.json:ro
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  onwards:
    image: ghcr.io/doublewordai/onwards:latest
    command: ["--targets", "/etc/onwards/targets.json"]
    volumes:
      - ./targets.json:/etc/onwards/targets.json:ro
    ports:
      - "3000:3000"

Where targets.json maps model names to llmux with API keys:

{
  "targets": {
    "qwen-14b": {
      "url": "http://llmux:3000/v1",
      "keys": ["sk-your-api-key"],
      "onwards_model": "qwen-14b"
    },
    "gemma-12b": {
      "url": "http://llmux:3000/v1",
      "keys": ["sk-your-api-key"],
      "onwards_model": "gemma-12b"
    }
  }
}

Tensor parallelism

All sleep levels work with TP>1:

Level TP>1 Notes
L1 Yes vLLM manages NCCL teardown/rebuild internally
L2 Yes Same — vLLM handles it
L3 Yes llmux tears down NCCL before cuda-checkpoint, rebuilds after restore
L4 Yes Same — NCCL teardown before checkpoint, rebuild after restore
L5 Yes Kill + cold restart always works

For L3 and L4, llmux uses vLLM's /collective_rpc endpoint to call suspend_nccl (before cuda-checkpoint) and resume_nccl (after restore) on all TP workers. This tears down NCCL IPC handles that cuda-checkpoint cannot checkpoint, then rebuilds them after CUDA state is restored.

This requires patched vLLM with suspend_nccl/resume_nccl support. The Docker image includes these patches. For bare-metal installs, apply patches/nccl-suspend-resume-v0.15.1.patch to your vLLM installation.

Known issues

The --validate flag exists specifically to catch these kinds of problems before they hit production.

vLLM v0.14+ sleep regression

Sleep mode (L1/L2) has a regression where weights are not discarded from GPU memory (vllm#32714). The Docker image includes a patch (fix-sleep-mode-v0.15.1.patch) that fixes this. For bare-metal installs, apply the patch or use L3/L5 instead.

vLLM v0.13.0

  • openai/gpt-oss-20b L2 reload fails. The MXFP4 weight loader crashes on wake with default_weight_loader() got an unexpected keyword argument 'weight_name'. L1 works fine (19.6s sleep, 0.6s wake). Use L1 for this model.
  • L1 and L2 both work correctly for Qwen/Qwen3-14B and google/gemma-3-12b-it.

NVIDIA driver requirements

The Docker image uses vLLM v0.15.1 which requires CUDA 12.9 and nvidia-driver-580 or later. Check your driver version with nvidia-smi.

Compatibility

  • L1/L2 sleep: Works with vLLM v0.13.x out of the box. Works with v0.15.1 with the sleep fix patch (included in Docker image).
  • L3 CUDA suspend / L4 CRIU checkpoint: Works with vLLM v0.13.x+ with NCCL patches (included in Docker image). Requires cuda-checkpoint and CRIU (included in Docker image).
  • TP>1 with L3/L4: Requires vLLM NCCL suspend/resume patches (included in Docker image) plus --enforce-eager and --disable-custom-all-reduce flags.

License

MIT