llmux 0.7.11

Zero-reload model switching for vLLM - manages multiple models on shared GPU
Documentation

llmux

Crates.io GitHub

LLM multiplexer for vLLM. Host multiple models on a single GPU, switching between them on demand using vLLM's sleep/wake API.

When a request arrives for a model that isn't currently loaded, llmux puts the active model to sleep (freeing GPU memory) and wakes the requested model. The OpenAI-compatible API stays up throughout - clients just change the model field.

How it works

                    Client requests
                         |
                    +---------+
                    |  llmux  |   port 3000 (OpenAI-compatible)
                    +---------+
                    /         \
            [vLLM:8001]    [vLLM:8002]
             (active)       (sleeping)

llmux spawns vLLM processes lazily on first request and manages their lifecycle. Only one model is active at a time - the rest are evicted using a configurable two-axis policy.

Eviction policy

When a model is evicted, llmux applies two strategies in sequence:

  1. Weight strategy — what vLLM does with model weights (via the sleep API)
  2. Process strategy — what happens to the OS process afterward

Weight strategy

Applied first, via vLLM's sleep endpoint. Controls what happens to weights and KV cache:

Strategy Description
retain Nothing happens. Weights and KV cache stay on GPU.
offload Weights copied to pinned CPU RAM. KV cache and CUDA graphs discarded. Frees most GPU memory but uses significant host RAM.
discard Weights dropped entirely (reloaded from disk on wake). KV cache and CUDA graphs discarded. Frees most GPU memory with no CPU RAM cost.

Both offload and discard leave a small CUDA context (~500 MiB) on the GPU.

Process strategy

Applied second, to the process left behind by the weight strategy:

Strategy Description
keep_running Process stays as-is. Fast, but whatever's still on GPU stays there.
cuda_suspend Snapshots remaining VRAM to host RAM via cuda-checkpoint, freeing 100% of GPU memory. Process stays alive.
checkpoint CRIU dumps the entire process (including host RAM) to disk, then kills it. Frees 100% of GPU and CPU memory.
stop Kills the process. Everything is lost.

How they interact

The weight strategy determines what's on GPU vs. CPU before the process strategy runs. This affects speed, memory usage, and what survives a wake cycle:

keep_running cuda_suspend checkpoint stop
retain No-op (model stays loaded) Full VRAM snapshot to CPU — weights, KV cache, CUDA graphs all preserved Full process checkpoint to disk — large image (includes VRAM) Everything lost
offload Weights on CPU, KV lost, ~500 MiB CUDA context remains on GPU Remaining CUDA context → CPU, weights already on CPU Large CRIU image (weights in host RAM get written to disk) Everything lost
discard Weights gone, KV lost, ~500 MiB CUDA context remains on GPU Remaining CUDA context → CPU, weights gone Small CRIU image (no weights — reloaded from HF cache on wake) Everything lost

Common choices:

  • offload + keep_running — Fast wake (weights already in RAM), but holds CPU memory and ~500 MiB GPU
  • discard + keep_running — No CPU RAM cost, but slow wake (reload from disk) and ~500 MiB GPU
  • retain + cuda_suspend — Frees 100% GPU, full state preserved, but holds all VRAM in CPU RAM
  • discard + checkpoint — Frees 100% GPU and CPU, small CRIU image; wake reloads weights from disk but restores KV cache, CUDA graphs, and warmed allocator from checkpoint
  • offload + checkpoint — Like above but CRIU image is large (includes weights); wake is faster (no disk reload) but checkpoint is slower and uses more disk

If eviction fails, llmux automatically escalates to stop to guarantee GPU memory is freed.

CUDA suspend

Uses cuda-checkpoint --toggle to suspend CUDA state and copy VRAM to host RAM. The process stays alive — no serialization, no CRIU. Wake is just another toggle to copy state back to GPU.

Like offload, this holds state in CPU RAM. Unlike offload, it frees 100% of GPU memory (offload keeps ~500 MiB for CUDA context) and preserves full state.

For TP>1, llmux coordinates NCCL teardown before checkpoint and rebuild after restore. This requires patched vLLM with suspend_nccl/resume_nccl support (included in the Docker image).

Requirements:

  • cuda-checkpoint utility (included in Docker image)
  • Root access (or passwordless sudo for cuda-checkpoint)
  • For TP>1: --enforce-eager and --disable-custom-all-reduce in extra_args

CRIU checkpoint

CRIU checkpointing uses cuda-checkpoint and criu to snapshot the entire vLLM process tree to disk, then kill it. On restore, CRIU brings the process back with all state intact — including GPU VRAM contents, KV cache, CUDA graphs, and the warmed memory allocator. First inference after restore is ~30ms (no warmup needed).

Requirements:

  • CRIU 4.x with the CUDA plugin (libcuda_plugin.so) (included in Docker image)
  • cuda-checkpoint utility (included in Docker image)
  • Root access (or passwordless sudo for criu and cuda-checkpoint)
  • vLLM process must not use io_uring or libuv (set automatically)
  • For TP>1: --enforce-eager and --disable-custom-all-reduce in extra_args

Trade-offs vs offload:

  • Slower sleep/wake than offload, but no CPU RAM cost and full state preservation
  • Slower wake than offload but faster first-inference (no warmup)

Quickstart

Create a config.json:

{
  "models": {
    "qwen-14b": {
      "model_path": "Qwen/Qwen3-14B",
      "port": 8001,
      "eviction": { "weights": "offload", "process": "keep_running" }
    },
    "gemma-12b": {
      "model_path": "google/gemma-3-12b-it",
      "port": 8002,
      "eviction": { "weights": "discard", "process": "keep_running" }
    }
  },
  "port": 3000
}

With Docker (recommended)

The Docker image bundles vLLM v0.15.1 with patches for NCCL suspend/resume and sleep mode fixes:

docker run --gpus all --init \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ./config.json:/etc/llmux/config.json:ro \
  -p 3000:3000 \
  ghcr.io/doublewordai/llmux:latest

For cuda_suspend or checkpoint process strategies, additional flags are required:

docker run --gpus all \
  --privileged \
  --pid=host \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ./config.json:/etc/llmux/config.json:ro \
  -v /tmp/llmux-checkpoints:/tmp/llmux-checkpoints \
  -p 3000:3000 \
  ghcr.io/doublewordai/llmux:latest

The extra flags are needed because:

  • --privileged — CRIU requires broad namespace and ptrace access
  • --pid=host — cuda-checkpoint needs to ptrace vLLM worker PIDs
  • --ipc=host — NCCL uses shared memory for inter-GPU communication
  • -v /tmp/llmux-checkpoints:... — CRIU checkpoints can be tens of GB; mount a host volume to avoid filling the container filesystem

Important: Do NOT use --init with CRIU (checkpoint process strategy). Docker's init process (tini) redirects stdin to the host's /dev/null, whose mount ID is invisible inside the container. CRIU dump fails with "Can't lookup mount=N for fd=0 path=/dev/null".

From source

Requires vLLM installed and available as vllm on PATH:

cargo install llmux
llmux --config config.json

For cuda_suspend or checkpoint process strategies, you also need cuda-checkpoint and criu (with CUDA plugin) installed and either run as root or configure passwordless sudo.

Send requests

# First request starts vLLM for qwen-14b
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen-14b", "messages": [{"role": "user", "content": "Hello"}]}'

# Switching: sleeps qwen-14b, starts gemma-12b
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gemma-12b", "messages": [{"role": "user", "content": "Hello"}]}'

Configuration

Model options

Field Default Description
model_path required HuggingFace model ID or local path
port required Port for this model's vLLM instance
eviction retain + stop Eviction policy (see below)
extra_args [] Additional vLLM CLI arguments
checkpoint_path none Path to CRIU checkpoint images for lazy restore on first request

The eviction field takes an object with weights and process keys:

{
  "eviction": { "weights": "offload", "process": "keep_running" }
}

All vLLM-specific flags (e.g. --gpu-memory-utilization, --tensor-parallel-size, --dtype) should be passed via extra_args:

{
  "model_path": "Qwen/Qwen3-14B",
  "port": 8001,
  "extra_args": ["--gpu-memory-utilization", "0.9", "--tensor-parallel-size", "2"]
}

Tensor parallelism with cuda_suspend/checkpoint

When using cuda_suspend or checkpoint with TP>1, you must include --enforce-eager and --disable-custom-all-reduce in extra_args:

{
  "model_path": "NousResearch/Meta-Llama-3.1-8B-Instruct",
  "port": 8001,
  "eviction": { "weights": "retain", "process": "cuda_suspend" },
  "extra_args": [
    "--tensor-parallel-size", "2",
    "--enforce-eager",
    "--disable-custom-all-reduce",
    "--gpu-memory-utilization", "0.85"
  ]
}
  • --enforce-eager — CUDA graphs hold stale NCCL handles and crash on resume
  • --disable-custom-all-reduce — CustomAllReduce IPC buffers cannot survive cuda-checkpoint

llmux validates the config at startup and warns if these flags are missing.

Top-level options

Field Default Description
port 3000 Proxy listen port
metrics_port 9090 Prometheus metrics port (0 to disable)
vllm_command "vllm" vLLM binary path

Checkpoint config

To use cuda_suspend or checkpoint process strategies, add a checkpoint section:

{
  "models": {
    "qwen-14b": {
      "model_path": "Qwen/Qwen3-14B",
      "port": 8001,
      "eviction": { "weights": "retain", "process": "cuda_suspend" }
    }
  },
  "checkpoint": {
    "cuda_checkpoint_path": "cuda-checkpoint"
  }
}

For CRIU checkpointing, the full config is:

{
  "checkpoint": {
    "criu_path": "criu",
    "cuda_plugin_dir": "/usr/lib/criu/",
    "images_dir": "/tmp/llmux-checkpoints",
    "cuda_checkpoint_path": "cuda-checkpoint"
  }
}
Field Default Description
criu_path "criu" Path to the criu binary
cuda_plugin_dir "/usr/lib/criu/" Directory containing libcuda_plugin.so
images_dir "/tmp/llmux-checkpoints" Base directory for checkpoint images
cuda_checkpoint_path "cuda-checkpoint" Path to the cuda-checkpoint utility

vLLM logging

vLLM process output (stdout/stderr) is always captured and forwarded to the vllm tracing target at debug level. Use RUST_LOG to control visibility:

# Default: only llmux info logs, vLLM output hidden
llmux --config config.json

# Show vLLM output
RUST_LOG=info,vllm=debug llmux --config config.json

# --verbose includes vLLM output automatically
llmux --config config.json --verbose

ANSI color codes are stripped from vLLM output. The NO_COLOR=1 environment variable is also set on spawned vLLM processes.

Policy options

Field Default Description
policy_type "fifo" Switching policy
request_timeout_secs 60 Request timeout
drain_before_switch true Wait for in-flight requests before sleeping
eviction retain + stop Default eviction policy

Validation

llmux includes a built-in validation tool that tests sleep/wake cycles against a running model, verifying GPU memory is freed and responses are deterministic after wake:

llmux --config config.json --validate qwen-14b \
  --policies offload+keep_running,discard+keep_running,retain+cuda_suspend \
  --verbose

Output:

Eviction          Sleep (s)   Wake (s)   GPU Before    GPU After     GPU Wake   Response   Pass
------------------------------------------------------------------------------------------------
Offload+KeepRun        35.9        1.2      45899 MiB       1341 MiB      44033 MiB      match     OK
Discard+KeepRun         0.3        8.2      44033 MiB       1341 MiB      44033 MiB      match     OK

Result: ALL PASSED

Checkpoint management

Pre-create CRIU checkpoints for fast model switching:

# Create checkpoint (start model, warm up, CRIU dump to disk)
llmux --config config.json --checkpoint qwen-14b

# Use a different weight strategy (affects CRIU image size)
llmux --config config.json --checkpoint qwen-14b --eviction retain+checkpoint

# Skip warmup inference before checkpointing
llmux --config config.json --checkpoint qwen-14b --no-warmup

# Restore detached (CRIU restore, health check, exit — process keeps running)
llmux --config config.json --restore-detached qwen-14b

The default eviction for --checkpoint is discard+checkpoint, which produces small CRIU images (weights are reloaded from the HF cache on restore). Use retain+checkpoint or offload+checkpoint for larger images that restore faster (weights already in the snapshot).

After --restore-detached, the vLLM process continues running on its configured port. This is useful for testing checkpoints or running a single model without the daemon.

Lazy restore via config

Instead of restoring manually, set checkpoint_path in your model config so the daemon restores from checkpoint on first request:

{
  "models": {
    "qwen-14b": {
      "model_path": "Qwen/Qwen3-14B",
      "port": 8001,
      "eviction": { "weights": "discard", "process": "checkpoint" },
      "checkpoint_path": "/tmp/llmux-checkpoints/qwen-14b/images"
    }
  }
}

When the daemon starts, models with checkpoint_path are initialized in checkpointed state. The first request triggers a CRIU restore instead of a cold start — typically 3-5x faster.

keep_images must be true (the default) in the checkpoint config when using checkpoint_path, since the images must persist across daemon restarts.

Docker Compose

Basic setup

services:
  llmux:
    image: ghcr.io/doublewordai/llmux:latest
    init: true
    command: ["--config", "/etc/llmux/config.json"]
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./config.json:/etc/llmux/config.json:ro
    ports:
      - "3000:3000"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

With cuda-checkpoint/CRIU

services:
  llmux:
    image: ghcr.io/doublewordai/llmux:latest
    init: true
    command: ["--config", "/etc/llmux/config.json"]
    pid: host
    ipc: host
    cap_add:
      - SYS_PTRACE
      - CHECKPOINT_RESTORE
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./config.json:/etc/llmux/config.json:ro
    ports:
      - "3000:3000"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

With onwards (API key auth)

For production, put onwards in front for API key authentication:

services:
  llmux:
    image: ghcr.io/doublewordai/llmux:latest
    init: true
    command: ["--config", "/etc/llmux/config.json"]
    pid: host
    ipc: host
    cap_add:
      - SYS_PTRACE
      - CHECKPOINT_RESTORE
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./config.json:/etc/llmux/config.json:ro
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  onwards:
    image: ghcr.io/doublewordai/onwards:latest
    command: ["--targets", "/etc/onwards/targets.json"]
    volumes:
      - ./targets.json:/etc/onwards/targets.json:ro
    ports:
      - "3000:3000"

Where targets.json maps model names to llmux with API keys:

{
  "targets": {
    "qwen-14b": {
      "url": "http://llmux:3000/v1",
      "keys": ["sk-your-api-key"],
      "onwards_model": "qwen-14b"
    },
    "gemma-12b": {
      "url": "http://llmux:3000/v1",
      "keys": ["sk-your-api-key"],
      "onwards_model": "gemma-12b"
    }
  }
}

Tensor parallelism

All eviction strategies work with TP>1:

Strategy TP>1 Notes
offload Yes vLLM manages NCCL teardown/rebuild internally
discard Yes Same — vLLM handles it
cuda_suspend Yes llmux tears down NCCL before cuda-checkpoint, rebuilds after restore
checkpoint Yes Same — NCCL teardown before checkpoint, rebuild after restore
stop Yes Kill + cold restart always works

For cuda_suspend and checkpoint, llmux uses vLLM's /collective_rpc endpoint to call suspend_nccl (before cuda-checkpoint) and resume_nccl (after restore) on all TP workers. This tears down NCCL IPC handles that cuda-checkpoint cannot checkpoint, then rebuilds them after CUDA state is restored.

This requires patched vLLM with suspend_nccl/resume_nccl support. The Docker image includes these patches. For bare-metal installs, apply patches/nccl-suspend-resume-v0.15.1.patch to your vLLM installation.

Known issues

The --validate flag exists specifically to catch these kinds of problems before they hit production.

vLLM v0.14+ sleep regression

Sleep mode (offload/discard) has a regression where weights are not discarded from GPU memory (vllm#32714). The Docker image includes a patch (fix-sleep-mode-v0.15.1.patch) that fixes this. For bare-metal installs, apply the patch or use cuda_suspend/stop instead.

vLLM v0.13.0

  • openai/gpt-oss-20b discard reload fails. The MXFP4 weight loader crashes on wake with default_weight_loader() got an unexpected keyword argument 'weight_name'. offload works fine (19.6s sleep, 0.6s wake). Use offload for this model.
  • offload and discard both work correctly for Qwen/Qwen3-14B and google/gemma-3-12b-it.

NVIDIA driver requirements

The Docker image uses vLLM v0.15.1 which requires CUDA 12.9 and nvidia-driver-580 or later. Check your driver version with nvidia-smi.

Compatibility

  • offload/discard: Works with vLLM v0.13.x out of the box. Works with v0.15.1 with the sleep fix patch (included in Docker image).
  • cuda_suspend/checkpoint: Works with vLLM v0.13.x+ with NCCL patches (included in Docker image). Requires cuda-checkpoint and CRIU (included in Docker image).
  • TP>1 with cuda_suspend/checkpoint: Requires vLLM NCCL suspend/resume patches (included in Docker image) plus --enforce-eager and --disable-custom-all-reduce flags.

License

MIT