llmux 0.5.0

Zero-reload model switching for vLLM - manages multiple models on shared GPU
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
# llmux

[![Crates.io](https://img.shields.io/crates/v/llmux)](https://crates.io/crates/llmux)
[![GitHub](https://img.shields.io/badge/GitHub-doublewordai%2Fllmux-blue)](https://github.com/doublewordai/llmux)

LLM multiplexer for vLLM. Host multiple models on a single GPU, switching
between them on demand using vLLM's sleep/wake API.

When a request arrives for a model that isn't currently loaded, llmux puts the
active model to sleep (freeing GPU memory) and wakes the requested model. The
OpenAI-compatible API stays up throughout - clients just change the `model`
field.

## How it works

```
                    Client requests
                         |
                    +---------+
                    |  llmux  |   port 3000 (OpenAI-compatible)
                    +---------+
                    /         \
            [vLLM:8001]    [vLLM:8002]
             (active)       (sleeping)
```

llmux spawns vLLM processes lazily on first request and manages their
lifecycle. Only one model is active at a time - the rest are sleeping
(weights offloaded to CPU or discarded) or stopped entirely.

### Sleep levels

| Level | Sleep | Wake | GPU freed | CPU RAM | State preserved | Use case |
|-------|-------|------|-----------|---------|-----------------|----------|
| **L1** | Slow (offload to CPU) | Fast (~1s) | Most | High (holds weights) | Partial | Model you expect to return to soon |
| **L2** | Fast (~1s) | Slow (reload from disk) | Most | None | No (KV cache, CUDA graphs lost) | Model you may not need for a while |
| **L3** (CUDA suspend) | Fast (~3s, ~15s at TP=2) | Fast (~3s, ~7s at TP=2) | All (100%) | High (holds VRAM) | Full | Like L1, but frees 100% GPU |
| **L4** (CRIU) | ~27s (checkpoint to disk) | ~15s (restore) | All (100%) | None | Full (KV cache, CUDA graphs, allocator) | Many models; lowest resource usage while sleeping |
| **L5** | Kill process | Cold start | All | None | No | Fallback / cleanup |

If L1-L4 sleep fails, llmux automatically escalates to L5 (kill) to guarantee
GPU memory is freed.

#### CUDA suspend (level 3)

Uses `cuda-checkpoint --toggle` to suspend CUDA state and copy VRAM to host
RAM. The process stays alive — no serialization, no CRIU. Wake is just another
toggle to copy state back to GPU.

Like L1, this holds state in CPU RAM. Unlike L1, it frees 100% of GPU memory
(L1 keeps ~500 MiB for CUDA context) and preserves full state.

For TP>1, llmux coordinates NCCL teardown before checkpoint and rebuild after
restore. This requires patched vLLM with `suspend_nccl`/`resume_nccl` support
(included in the Docker image).

**Requirements:**
- `cuda-checkpoint` utility (included in Docker image)
- Root access (or passwordless `sudo` for `cuda-checkpoint`)
- For TP>1: `--enforce-eager` and `--disable-custom-all-reduce` in `extra_args`

#### CRIU checkpoint (level 4)

CRIU checkpointing uses `cuda-checkpoint` and `criu` to snapshot the entire
vLLM process tree to disk, then kill it. On restore, CRIU brings the process
back with all state intact — including GPU VRAM contents, KV cache, CUDA
graphs, and the warmed memory allocator. First inference after restore is ~30ms
(no warmup needed).

**Requirements:**
- CRIU 4.x with the CUDA plugin (`libcuda_plugin.so`) (included in Docker image)
- `cuda-checkpoint` utility (included in Docker image)
- Root access (or passwordless `sudo` for `criu` and `cuda-checkpoint`)
- vLLM process must not use `io_uring` or `libuv` (set automatically)
- For TP>1: `--enforce-eager` and `--disable-custom-all-reduce` in `extra_args`

**Trade-offs vs L1/L2:**
- Slower sleep/wake than L1, but no CPU RAM cost and full state preservation
- Slower wake than L1 but faster first-inference (no warmup)

## Quickstart

Create a `config.json`:

```json
{
  "models": {
    "qwen-14b": {
      "model_path": "Qwen/Qwen3-14B",
      "port": 8001,
      "sleep_level": 1
    },
    "gemma-12b": {
      "model_path": "google/gemma-3-12b-it",
      "port": 8002,
      "sleep_level": 2
    }
  },
  "port": 3000
}
```

### With Docker (recommended)

The Docker image bundles vLLM v0.15.1 with patches for NCCL suspend/resume
and sleep mode fixes:

```bash
docker run --gpus all --init \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ./config.json:/etc/llmux/config.json:ro \
  -p 3000:3000 \
  ghcr.io/doublewordai/llmux:latest
```

For sleep levels 3 and 4 (cuda-checkpoint/CRIU), additional flags are required:

```bash
docker run --gpus all \
  --privileged \
  --pid=host \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ./config.json:/etc/llmux/config.json:ro \
  -v /tmp/llmux-checkpoints:/tmp/llmux-checkpoints \
  -p 3000:3000 \
  ghcr.io/doublewordai/llmux:latest
```

The extra flags are needed because:
- `--privileged` — CRIU requires broad namespace and ptrace access
- `--pid=host` — cuda-checkpoint needs to ptrace vLLM worker PIDs
- `--ipc=host` — NCCL uses shared memory for inter-GPU communication
- `-v /tmp/llmux-checkpoints:...` — CRIU checkpoints can be tens of GB; mount a host volume to avoid filling the container filesystem

**Important:** Do NOT use `--init` with CRIU (sleep level 4). Docker's init process (tini)
redirects stdin to the host's `/dev/null`, whose mount ID is invisible inside the container.
CRIU dump fails with "Can't lookup mount=N for fd=0 path=/dev/null".

### From source

Requires vLLM installed and available as `vllm` on PATH:

```bash
cargo install llmux
llmux --config config.json
```

For sleep levels 3 and 4, you also need `cuda-checkpoint` and `criu` (with
CUDA plugin) installed and either run as root or configure passwordless sudo.

### Send requests

```bash
# First request starts vLLM for qwen-14b
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen-14b", "messages": [{"role": "user", "content": "Hello"}]}'

# Switching: sleeps qwen-14b, starts gemma-12b
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gemma-12b", "messages": [{"role": "user", "content": "Hello"}]}'
```

## Configuration

### Model options

| Field | Default | Description |
|-------|---------|-------------|
| `model_path` | *required* | HuggingFace model ID or local path |
| `port` | *required* | Port for this model's vLLM instance |
| `sleep_level` | `5` | Sleep level (1-2: vLLM, 3: CUDA suspend, 4: CRIU, 5: stop) |
| `extra_args` | `[]` | Additional vLLM CLI arguments |

All vLLM-specific flags (e.g. `--gpu-memory-utilization`, `--tensor-parallel-size`,
`--dtype`) should be passed via `extra_args`:

```json
{
  "model_path": "Qwen/Qwen3-14B",
  "port": 8001,
  "extra_args": ["--gpu-memory-utilization", "0.9", "--tensor-parallel-size", "2"]
}
```

#### Tensor parallelism with L3/L4

When using sleep levels 3 or 4 with TP>1, you **must** include `--enforce-eager`
and `--disable-custom-all-reduce` in `extra_args`:

```json
{
  "model_path": "NousResearch/Meta-Llama-3.1-8B-Instruct",
  "port": 8001,
  "sleep_level": 3,
  "extra_args": [
    "--tensor-parallel-size", "2",
    "--enforce-eager",
    "--disable-custom-all-reduce",
    "--gpu-memory-utilization", "0.85"
  ]
}
```

- `--enforce-eager` — CUDA graphs hold stale NCCL handles and crash on resume
- `--disable-custom-all-reduce` — CustomAllReduce IPC buffers cannot survive cuda-checkpoint

llmux validates the config at startup and warns if these flags are missing.

### Top-level options

| Field | Default | Description |
|-------|---------|-------------|
| `port` | `3000` | Proxy listen port |
| `metrics_port` | `9090` | Prometheus metrics port (0 to disable) |
| `vllm_command` | `"vllm"` | vLLM binary path |

### Checkpoint config (for sleep levels 3 and 4)

To use cuda-checkpoint or CRIU, add a `checkpoint` section to your config:

```json
{
  "models": {
    "qwen-14b": {
      "model_path": "Qwen/Qwen3-14B",
      "port": 8001,
      "sleep_level": 3
    }
  },
  "checkpoint": {
    "cuda_checkpoint_path": "cuda-checkpoint"
  }
}
```

For CRIU checkpointing (level 4), the full config is:

```json
{
  "checkpoint": {
    "criu_path": "criu",
    "cuda_plugin_dir": "/usr/lib/criu/",
    "images_dir": "/tmp/llmux-checkpoints",
    "cuda_checkpoint_path": "cuda-checkpoint"
  }
}
```

| Field | Default | Description |
|-------|---------|-------------|
| `criu_path` | `"criu"` | Path to the criu binary |
| `cuda_plugin_dir` | `"/usr/lib/criu/"` | Directory containing `libcuda_plugin.so` |
| `images_dir` | `"/tmp/llmux-checkpoints"` | Base directory for checkpoint images |
| `cuda_checkpoint_path` | `"cuda-checkpoint"` | Path to the cuda-checkpoint utility |

### vLLM logging

vLLM process output (stdout/stderr) is always captured and forwarded to the
`vllm` tracing target at `debug` level. Use `RUST_LOG` to control visibility:

```bash
# Default: only llmux info logs, vLLM output hidden
llmux --config config.json

# Show vLLM output
RUST_LOG=info,vllm=debug llmux --config config.json

# --verbose includes vLLM output automatically
llmux --config config.json --verbose
```

ANSI color codes are stripped from vLLM output. The `NO_COLOR=1` environment
variable is also set on spawned vLLM processes.

### Policy options

| Field | Default | Description |
|-------|---------|-------------|
| `policy_type` | `"fifo"` | Switching policy |
| `request_timeout_secs` | `60` | Request timeout |
| `drain_before_switch` | `true` | Wait for in-flight requests before sleeping |
| `sleep_level` | `5` | Default sleep level for policy |

## Validation

llmux includes a built-in validation tool that tests sleep/wake cycles
against a running model, verifying GPU memory is freed and responses are
deterministic after wake:

```bash
llmux --config config.json --validate qwen-14b --levels 1,2,3,4 --verbose
```

Output:

```
Level     Sleep (s)   Wake (s)   GPU Before    GPU After     GPU Wake   Response   Pass
----------------------------------------------------------------------------------------
L1             35.9        1.2      45899 MiB       1341 MiB      44033 MiB      match     OK
L2              0.3        8.2      44033 MiB       1341 MiB      44033 MiB      match     OK

Result: ALL PASSED
```

## Docker Compose

### Basic setup

```yaml
services:
  llmux:
    image: ghcr.io/doublewordai/llmux:latest
    init: true
    command: ["--config", "/etc/llmux/config.json"]
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./config.json:/etc/llmux/config.json:ro
    ports:
      - "3000:3000"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
```

### With cuda-checkpoint/CRIU (sleep levels 3 and 4)

```yaml
services:
  llmux:
    image: ghcr.io/doublewordai/llmux:latest
    init: true
    command: ["--config", "/etc/llmux/config.json"]
    pid: host
    ipc: host
    cap_add:
      - SYS_PTRACE
      - CHECKPOINT_RESTORE
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./config.json:/etc/llmux/config.json:ro
    ports:
      - "3000:3000"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
```

### With onwards (API key auth)

For production, put [onwards](https://github.com/doublewordai/onwards) in
front for API key authentication:

```yaml
services:
  llmux:
    image: ghcr.io/doublewordai/llmux:latest
    init: true
    command: ["--config", "/etc/llmux/config.json"]
    pid: host
    ipc: host
    cap_add:
      - SYS_PTRACE
      - CHECKPOINT_RESTORE
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./config.json:/etc/llmux/config.json:ro
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  onwards:
    image: ghcr.io/doublewordai/onwards:latest
    command: ["--targets", "/etc/onwards/targets.json"]
    volumes:
      - ./targets.json:/etc/onwards/targets.json:ro
    ports:
      - "3000:3000"
```

Where `targets.json` maps model names to llmux with API keys:

```json
{
  "targets": {
    "qwen-14b": {
      "url": "http://llmux:3000/v1",
      "keys": ["sk-your-api-key"],
      "onwards_model": "qwen-14b"
    },
    "gemma-12b": {
      "url": "http://llmux:3000/v1",
      "keys": ["sk-your-api-key"],
      "onwards_model": "gemma-12b"
    }
  }
}
```

## Tensor parallelism

All sleep levels work with TP>1:

| Level | TP>1 | Notes |
|-------|------|-------|
| L1 | Yes | vLLM manages NCCL teardown/rebuild internally |
| L2 | Yes | Same — vLLM handles it |
| L3 | Yes | llmux tears down NCCL before cuda-checkpoint, rebuilds after restore |
| L4 | Yes | Same — NCCL teardown before checkpoint, rebuild after restore |
| L5 | Yes | Kill + cold restart always works |

For L3 and L4, llmux uses vLLM's `/collective_rpc` endpoint to call
`suspend_nccl` (before cuda-checkpoint) and `resume_nccl` (after restore)
on all TP workers. This tears down NCCL IPC handles that cuda-checkpoint
cannot checkpoint, then rebuilds them after CUDA state is restored.

This requires patched vLLM with `suspend_nccl`/`resume_nccl` support. The
Docker image includes these patches. For bare-metal installs, apply
`patches/nccl-suspend-resume-v0.15.1.patch` to your vLLM installation.

## Known issues

The `--validate` flag exists specifically to catch these kinds of problems
before they hit production.

### vLLM v0.14+ sleep regression

Sleep mode (L1/L2) has a regression where weights are not discarded from GPU
memory ([vllm#32714](https://github.com/vllm-project/vllm/issues/32714)).
The Docker image includes a patch (`fix-sleep-mode-v0.15.1.patch`) that fixes
this. For bare-metal installs, apply the patch or use L3/L5 instead.

### vLLM v0.13.0

- **`openai/gpt-oss-20b` L2 reload fails.** The MXFP4 weight loader crashes on
  wake with `default_weight_loader() got an unexpected keyword argument
  'weight_name'`. L1 works fine (19.6s sleep, 0.6s wake). Use L1 for this
  model.
- L1 and L2 both work correctly for `Qwen/Qwen3-14B` and
  `google/gemma-3-12b-it`.

### NVIDIA driver requirements

The Docker image uses vLLM v0.15.1 which requires CUDA 12.9 and
nvidia-driver-580 or later. Check your driver version with `nvidia-smi`.

## Compatibility

- **L1/L2 sleep:** Works with vLLM v0.13.x out of the box. Works with v0.15.1 with the sleep fix patch (included in Docker image).
- **L3 CUDA suspend / L4 CRIU checkpoint:** Works with vLLM v0.13.x+ with NCCL patches (included in Docker image). Requires `cuda-checkpoint` and CRIU (included in Docker image).
- **TP>1 with L3/L4:** Requires vLLM NCCL suspend/resume patches (included in Docker image) plus `--enforce-eager` and `--disable-custom-all-reduce` flags.

## License

MIT