llmux 0.7.3

Zero-reload model switching for vLLM - manages multiple models on shared GPU
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
# llmux

[![Crates.io](https://img.shields.io/crates/v/llmux)](https://crates.io/crates/llmux)
[![GitHub](https://img.shields.io/badge/GitHub-doublewordai%2Fllmux-blue)](https://github.com/doublewordai/llmux)

LLM multiplexer for vLLM. Host multiple models on a single GPU, switching
between them on demand using vLLM's sleep/wake API.

When a request arrives for a model that isn't currently loaded, llmux puts the
active model to sleep (freeing GPU memory) and wakes the requested model. The
OpenAI-compatible API stays up throughout - clients just change the `model`
field.

## How it works

```
                    Client requests
                         |
                    +---------+
                    |  llmux  |   port 3000 (OpenAI-compatible)
                    +---------+
                    /         \
            [vLLM:8001]    [vLLM:8002]
             (active)       (sleeping)
```

llmux spawns vLLM processes lazily on first request and manages their
lifecycle. Only one model is active at a time - the rest are evicted using
a configurable two-axis policy.

### Eviction policy

When a model is evicted, llmux applies two strategies in sequence:

1. **Weight strategy** — what vLLM does with model weights (via the sleep API)
2. **Process strategy** — what happens to the OS process afterward

#### Weight strategy

Applied first, via vLLM's sleep endpoint. Controls what happens to weights and KV cache:

| Strategy | Description |
|----------|-------------|
| `retain` | Nothing happens. Weights and KV cache stay on GPU. |
| `offload` | Weights copied to pinned CPU RAM. KV cache and CUDA graphs discarded. Frees most GPU memory but uses significant host RAM. |
| `discard` | Weights dropped entirely (reloaded from disk on wake). KV cache and CUDA graphs discarded. Frees most GPU memory with no CPU RAM cost. |

Both `offload` and `discard` leave a small CUDA context (~500 MiB) on the GPU.

#### Process strategy

Applied second, to the process left behind by the weight strategy:

| Strategy | Description |
|----------|-------------|
| `keep_running` | Process stays as-is. Fast, but whatever's still on GPU stays there. |
| `cuda_suspend` | Snapshots remaining VRAM to host RAM via `cuda-checkpoint`, freeing 100% of GPU memory. Process stays alive. |
| `checkpoint` | CRIU dumps the entire process (including host RAM) to disk, then kills it. Frees 100% of GPU and CPU memory. |
| `stop` | Kills the process. Everything is lost. |

#### How they interact

The weight strategy determines what's on GPU vs. CPU before the process strategy runs.
This affects speed, memory usage, and what survives a wake cycle:

| | `keep_running` | `cuda_suspend` | `checkpoint` | `stop` |
|---|---|---|---|---|
| **`retain`** | No-op (model stays loaded) | Full VRAM snapshot to CPU — weights, KV cache, CUDA graphs all preserved | Full process checkpoint to disk — large image (includes VRAM) | Everything lost |
| **`offload`** | Weights on CPU, KV lost, ~500 MiB CUDA context remains on GPU | Remaining CUDA context → CPU, weights already on CPU | Large CRIU image (weights in host RAM get written to disk) | Everything lost |
| **`discard`** | Weights gone, KV lost, ~500 MiB CUDA context remains on GPU | Remaining CUDA context → CPU, weights gone | Small CRIU image (no weights — reloaded from HF cache on wake) | Everything lost |

Common choices:

- **`offload` + `keep_running`** — Fast wake (weights already in RAM), but holds CPU memory and ~500 MiB GPU
- **`discard` + `keep_running`** — No CPU RAM cost, but slow wake (reload from disk) and ~500 MiB GPU
- **`retain` + `cuda_suspend`** — Frees 100% GPU, full state preserved, but holds all VRAM in CPU RAM
- **`discard` + `checkpoint`** — Frees 100% GPU *and* CPU, small CRIU image; wake reloads weights from disk but restores KV cache, CUDA graphs, and warmed allocator from checkpoint
- **`offload` + `checkpoint`** — Like above but CRIU image is large (includes weights); wake is faster (no disk reload) but checkpoint is slower and uses more disk

If eviction fails, llmux automatically escalates to `stop` to guarantee GPU
memory is freed.

#### CUDA suspend

Uses `cuda-checkpoint --toggle` to suspend CUDA state and copy VRAM to host
RAM. The process stays alive — no serialization, no CRIU. Wake is just another
toggle to copy state back to GPU.

Like `offload`, this holds state in CPU RAM. Unlike `offload`, it frees 100% of
GPU memory (`offload` keeps ~500 MiB for CUDA context) and preserves full state.

For TP>1, llmux coordinates NCCL teardown before checkpoint and rebuild after
restore. This requires patched vLLM with `suspend_nccl`/`resume_nccl` support
(included in the Docker image).

**Requirements:**
- `cuda-checkpoint` utility (included in Docker image)
- Root access (or passwordless `sudo` for `cuda-checkpoint`)
- For TP>1: `--enforce-eager` and `--disable-custom-all-reduce` in `extra_args`

#### CRIU checkpoint

CRIU checkpointing uses `cuda-checkpoint` and `criu` to snapshot the entire
vLLM process tree to disk, then kill it. On restore, CRIU brings the process
back with all state intact — including GPU VRAM contents, KV cache, CUDA
graphs, and the warmed memory allocator. First inference after restore is ~30ms
(no warmup needed).

**Requirements:**
- CRIU 4.x with the CUDA plugin (`libcuda_plugin.so`) (included in Docker image)
- `cuda-checkpoint` utility (included in Docker image)
- Root access (or passwordless `sudo` for `criu` and `cuda-checkpoint`)
- vLLM process must not use `io_uring` or `libuv` (set automatically)
- For TP>1: `--enforce-eager` and `--disable-custom-all-reduce` in `extra_args`

**Trade-offs vs offload:**
- Slower sleep/wake than `offload`, but no CPU RAM cost and full state preservation
- Slower wake than `offload` but faster first-inference (no warmup)

## Quickstart

Create a `config.json`:

```json
{
  "models": {
    "qwen-14b": {
      "model_path": "Qwen/Qwen3-14B",
      "port": 8001,
      "eviction": { "weights": "offload", "process": "keep_running" }
    },
    "gemma-12b": {
      "model_path": "google/gemma-3-12b-it",
      "port": 8002,
      "eviction": { "weights": "discard", "process": "keep_running" }
    }
  },
  "port": 3000
}
```

### With Docker (recommended)

The Docker image bundles vLLM v0.15.1 with patches for NCCL suspend/resume
and sleep mode fixes:

```bash
docker run --gpus all --init \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ./config.json:/etc/llmux/config.json:ro \
  -p 3000:3000 \
  ghcr.io/doublewordai/llmux:latest
```

For `cuda_suspend` or `checkpoint` process strategies, additional flags are required:

```bash
docker run --gpus all \
  --privileged \
  --pid=host \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ./config.json:/etc/llmux/config.json:ro \
  -v /tmp/llmux-checkpoints:/tmp/llmux-checkpoints \
  -p 3000:3000 \
  ghcr.io/doublewordai/llmux:latest
```

The extra flags are needed because:
- `--privileged` — CRIU requires broad namespace and ptrace access
- `--pid=host` — cuda-checkpoint needs to ptrace vLLM worker PIDs
- `--ipc=host` — NCCL uses shared memory for inter-GPU communication
- `-v /tmp/llmux-checkpoints:...` — CRIU checkpoints can be tens of GB; mount a host volume to avoid filling the container filesystem

**Important:** Do NOT use `--init` with CRIU (`checkpoint` process strategy). Docker's init process (tini)
redirects stdin to the host's `/dev/null`, whose mount ID is invisible inside the container.
CRIU dump fails with "Can't lookup mount=N for fd=0 path=/dev/null".

### From source

Requires vLLM installed and available as `vllm` on PATH:

```bash
cargo install llmux
llmux --config config.json
```

For `cuda_suspend` or `checkpoint` process strategies, you also need `cuda-checkpoint`
and `criu` (with CUDA plugin) installed and either run as root or configure passwordless sudo.

### Send requests

```bash
# First request starts vLLM for qwen-14b
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen-14b", "messages": [{"role": "user", "content": "Hello"}]}'

# Switching: sleeps qwen-14b, starts gemma-12b
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gemma-12b", "messages": [{"role": "user", "content": "Hello"}]}'
```

## Configuration

### Model options

| Field | Default | Description |
|-------|---------|-------------|
| `model_path` | *required* | HuggingFace model ID or local path |
| `port` | *required* | Port for this model's vLLM instance |
| `eviction` | `retain` + `stop` | Eviction policy (see below) |
| `extra_args` | `[]` | Additional vLLM CLI arguments |
| `checkpoint_path` | *none* | Path to CRIU checkpoint images for lazy restore on first request |

The `eviction` field takes an object with `weights` and `process` keys:

```json
{
  "eviction": { "weights": "offload", "process": "keep_running" }
}
```

All vLLM-specific flags (e.g. `--gpu-memory-utilization`, `--tensor-parallel-size`,
`--dtype`) should be passed via `extra_args`:

```json
{
  "model_path": "Qwen/Qwen3-14B",
  "port": 8001,
  "extra_args": ["--gpu-memory-utilization", "0.9", "--tensor-parallel-size", "2"]
}
```

#### Tensor parallelism with cuda_suspend/checkpoint

When using `cuda_suspend` or `checkpoint` with TP>1, you **must** include `--enforce-eager`
and `--disable-custom-all-reduce` in `extra_args`:

```json
{
  "model_path": "NousResearch/Meta-Llama-3.1-8B-Instruct",
  "port": 8001,
  "eviction": { "weights": "retain", "process": "cuda_suspend" },
  "extra_args": [
    "--tensor-parallel-size", "2",
    "--enforce-eager",
    "--disable-custom-all-reduce",
    "--gpu-memory-utilization", "0.85"
  ]
}
```

- `--enforce-eager` — CUDA graphs hold stale NCCL handles and crash on resume
- `--disable-custom-all-reduce` — CustomAllReduce IPC buffers cannot survive cuda-checkpoint

llmux validates the config at startup and warns if these flags are missing.

### Top-level options

| Field | Default | Description |
|-------|---------|-------------|
| `port` | `3000` | Proxy listen port |
| `metrics_port` | `9090` | Prometheus metrics port (0 to disable) |
| `vllm_command` | `"vllm"` | vLLM binary path |

### Checkpoint config

To use `cuda_suspend` or `checkpoint` process strategies, add a `checkpoint` section:

```json
{
  "models": {
    "qwen-14b": {
      "model_path": "Qwen/Qwen3-14B",
      "port": 8001,
      "eviction": { "weights": "retain", "process": "cuda_suspend" }
    }
  },
  "checkpoint": {
    "cuda_checkpoint_path": "cuda-checkpoint"
  }
}
```

For CRIU checkpointing, the full config is:

```json
{
  "checkpoint": {
    "criu_path": "criu",
    "cuda_plugin_dir": "/usr/lib/criu/",
    "images_dir": "/tmp/llmux-checkpoints",
    "cuda_checkpoint_path": "cuda-checkpoint"
  }
}
```

| Field | Default | Description |
|-------|---------|-------------|
| `criu_path` | `"criu"` | Path to the criu binary |
| `cuda_plugin_dir` | `"/usr/lib/criu/"` | Directory containing `libcuda_plugin.so` |
| `images_dir` | `"/tmp/llmux-checkpoints"` | Base directory for checkpoint images |
| `cuda_checkpoint_path` | `"cuda-checkpoint"` | Path to the cuda-checkpoint utility |

### vLLM logging

vLLM process output (stdout/stderr) is always captured and forwarded to the
`vllm` tracing target at `debug` level. Use `RUST_LOG` to control visibility:

```bash
# Default: only llmux info logs, vLLM output hidden
llmux --config config.json

# Show vLLM output
RUST_LOG=info,vllm=debug llmux --config config.json

# --verbose includes vLLM output automatically
llmux --config config.json --verbose
```

ANSI color codes are stripped from vLLM output. The `NO_COLOR=1` environment
variable is also set on spawned vLLM processes.

### Policy options

| Field | Default | Description |
|-------|---------|-------------|
| `policy_type` | `"fifo"` | Switching policy |
| `request_timeout_secs` | `60` | Request timeout |
| `drain_before_switch` | `true` | Wait for in-flight requests before sleeping |
| `eviction` | `retain` + `stop` | Default eviction policy |

## Validation

llmux includes a built-in validation tool that tests sleep/wake cycles
against a running model, verifying GPU memory is freed and responses are
deterministic after wake:

```bash
llmux --config config.json --validate qwen-14b \
  --policies offload+keep_running,discard+keep_running,retain+cuda_suspend \
  --verbose
```

Output:

```
Eviction          Sleep (s)   Wake (s)   GPU Before    GPU After     GPU Wake   Response   Pass
------------------------------------------------------------------------------------------------
Offload+KeepRun        35.9        1.2      45899 MiB       1341 MiB      44033 MiB      match     OK
Discard+KeepRun         0.3        8.2      44033 MiB       1341 MiB      44033 MiB      match     OK

Result: ALL PASSED
```

## Checkpoint management

Pre-create CRIU checkpoints for fast model switching:

```bash
# Create checkpoint (start model, warm up, CRIU dump to disk)
llmux --config config.json --checkpoint qwen-14b

# Use a different weight strategy (affects CRIU image size)
llmux --config config.json --checkpoint qwen-14b --eviction retain+checkpoint

# Skip warmup inference before checkpointing
llmux --config config.json --checkpoint qwen-14b --no-warmup

# Restore detached (CRIU restore, health check, exit — process keeps running)
llmux --config config.json --restore-detached qwen-14b
```

The default eviction for `--checkpoint` is `discard+checkpoint`, which produces
small CRIU images (weights are reloaded from the HF cache on restore). Use
`retain+checkpoint` or `offload+checkpoint` for larger images that restore
faster (weights already in the snapshot).

After `--restore-detached`, the vLLM process continues running on its configured
port. This is useful for testing checkpoints or running a single model without
the daemon.

### Lazy restore via config

Instead of restoring manually, set `checkpoint_path` in your model config so
the daemon restores from checkpoint on first request:

```json
{
  "models": {
    "qwen-14b": {
      "model_path": "Qwen/Qwen3-14B",
      "port": 8001,
      "eviction": { "weights": "discard", "process": "checkpoint" },
      "checkpoint_path": "/tmp/llmux-checkpoints/qwen-14b/images"
    }
  }
}
```

When the daemon starts, models with `checkpoint_path` are initialized in
checkpointed state. The first request triggers a CRIU restore instead of a
cold start — typically 3-5x faster.

`keep_images` must be `true` (the default) in the `checkpoint` config when
using `checkpoint_path`, since the images must persist across daemon restarts.

## Docker Compose

### Basic setup

```yaml
services:
  llmux:
    image: ghcr.io/doublewordai/llmux:latest
    init: true
    command: ["--config", "/etc/llmux/config.json"]
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./config.json:/etc/llmux/config.json:ro
    ports:
      - "3000:3000"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
```

### With cuda-checkpoint/CRIU

```yaml
services:
  llmux:
    image: ghcr.io/doublewordai/llmux:latest
    init: true
    command: ["--config", "/etc/llmux/config.json"]
    pid: host
    ipc: host
    cap_add:
      - SYS_PTRACE
      - CHECKPOINT_RESTORE
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./config.json:/etc/llmux/config.json:ro
    ports:
      - "3000:3000"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
```

### With onwards (API key auth)

For production, put [onwards](https://github.com/doublewordai/onwards) in
front for API key authentication:

```yaml
services:
  llmux:
    image: ghcr.io/doublewordai/llmux:latest
    init: true
    command: ["--config", "/etc/llmux/config.json"]
    pid: host
    ipc: host
    cap_add:
      - SYS_PTRACE
      - CHECKPOINT_RESTORE
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./config.json:/etc/llmux/config.json:ro
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  onwards:
    image: ghcr.io/doublewordai/onwards:latest
    command: ["--targets", "/etc/onwards/targets.json"]
    volumes:
      - ./targets.json:/etc/onwards/targets.json:ro
    ports:
      - "3000:3000"
```

Where `targets.json` maps model names to llmux with API keys:

```json
{
  "targets": {
    "qwen-14b": {
      "url": "http://llmux:3000/v1",
      "keys": ["sk-your-api-key"],
      "onwards_model": "qwen-14b"
    },
    "gemma-12b": {
      "url": "http://llmux:3000/v1",
      "keys": ["sk-your-api-key"],
      "onwards_model": "gemma-12b"
    }
  }
}
```

## Tensor parallelism

All eviction strategies work with TP>1:

| Strategy | TP>1 | Notes |
|----------|------|-------|
| `offload` | Yes | vLLM manages NCCL teardown/rebuild internally |
| `discard` | Yes | Same — vLLM handles it |
| `cuda_suspend` | Yes | llmux tears down NCCL before cuda-checkpoint, rebuilds after restore |
| `checkpoint` | Yes | Same — NCCL teardown before checkpoint, rebuild after restore |
| `stop` | Yes | Kill + cold restart always works |

For `cuda_suspend` and `checkpoint`, llmux uses vLLM's `/collective_rpc` endpoint to call
`suspend_nccl` (before cuda-checkpoint) and `resume_nccl` (after restore)
on all TP workers. This tears down NCCL IPC handles that cuda-checkpoint
cannot checkpoint, then rebuilds them after CUDA state is restored.

This requires patched vLLM with `suspend_nccl`/`resume_nccl` support. The
Docker image includes these patches. For bare-metal installs, apply
`patches/nccl-suspend-resume-v0.15.1.patch` to your vLLM installation.

## Known issues

The `--validate` flag exists specifically to catch these kinds of problems
before they hit production.

### vLLM v0.14+ sleep regression

Sleep mode (`offload`/`discard`) has a regression where weights are not discarded from GPU
memory ([vllm#32714](https://github.com/vllm-project/vllm/issues/32714)).
The Docker image includes a patch (`fix-sleep-mode-v0.15.1.patch`) that fixes
this. For bare-metal installs, apply the patch or use `cuda_suspend`/`stop` instead.

### vLLM v0.13.0

- **`openai/gpt-oss-20b` `discard` reload fails.** The MXFP4 weight loader crashes on
  wake with `default_weight_loader() got an unexpected keyword argument
  'weight_name'`. `offload` works fine (19.6s sleep, 0.6s wake). Use `offload` for this
  model.
- `offload` and `discard` both work correctly for `Qwen/Qwen3-14B` and
  `google/gemma-3-12b-it`.

### NVIDIA driver requirements

The Docker image uses vLLM v0.15.1 which requires CUDA 12.9 and
nvidia-driver-580 or later. Check your driver version with `nvidia-smi`.

## Compatibility

- **`offload`/`discard`:** Works with vLLM v0.13.x out of the box. Works with v0.15.1 with the sleep fix patch (included in Docker image).
- **`cuda_suspend`/`checkpoint`:** Works with vLLM v0.13.x+ with NCCL patches (included in Docker image). Requires `cuda-checkpoint` and CRIU (included in Docker image).
- **TP>1 with `cuda_suspend`/`checkpoint`:** Requires vLLM NCCL suspend/resume patches (included in Docker image) plus `--enforce-eager` and `--disable-custom-all-reduce` flags.

## License

MIT