# llmux
[](https://crates.io/crates/llmux)
[](https://github.com/doublewordai/llmux)
LLM multiplexer. Routes OpenAI-compatible requests to model backends and
switches between them on demand using user-provided scripts.
When a request arrives for a model that isn't currently loaded, llmux drains
in-flight requests, runs your **sleep** script on the active model, then runs
your **wake** script on the requested model. The API stays up throughout —
clients just change the `model` field.
llmux doesn't manage model processes directly. You provide three shell scripts
per model (**wake**, **sleep**, **alive**) and llmux calls them at the right
time. This means it works with any backend — vLLM, SGLang, llama.cpp, Ollama,
or anything else that speaks HTTP.
## Install
```sh
cargo install llmux
```
## Quick start
Create a `config.yaml`:
```yaml
models:
llama:
port: 8001
wake: ./scripts/wake-llama.sh
sleep: ./scripts/sleep-llama.sh
alive: curl -sf http://localhost:8001/health
mistral:
port: 8002
wake: ./scripts/wake-mistral.sh
sleep: ./scripts/sleep-mistral.sh
alive: curl -sf http://localhost:8002/health
port: 3000
```
Run it:
```sh
llmux -c config.yaml
```
Send requests:
```sh
curl http://localhost:3000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "llama", "messages": [{"role": "user", "content": "Hi"}]}'
```
When a request comes in for `mistral`, llmux will drain active requests,
run `sleep-llama.sh`, then `wake-mistral.sh`, and proxy the request through.
## How it works
```
Client requests
|
+---------+
/ \
[8001] [8002]
llama mistral
(active) (sleeping)
```
1. **Middleware** extracts the `model` field from the request JSON body
2. **Switcher** checks if that model is active. If not, triggers a switch:
- Drains in-flight requests for the current model
- Runs the **sleep** hook on the current model
- Runs the **wake** hook on the target model
3. **Proxy** forwards the request to `localhost:<model_port>`
4. In-flight tracking uses RAII guards that hold through streaming responses
## Configuration
### Models
Each model needs a `port` and three hooks:
```yaml
models:
my-model:
port: 8001
wake: |
# Bring the model to a ready state (must be idempotent).
# Exit 0 when the model is ready to serve requests.
docker start my-model-container
for i in $(seq 1 60); do
curl -sf http://localhost:8001/health && exit 0
sleep 1
done
exit 1
sleep: |
# Free resources. Exit 0 when done.
docker stop my-model-container
alive: |
# Health check. Exit 0 = healthy, non-zero = unhealthy.
curl -sf http://localhost:8001/health
```
Hooks are executed via `sh -c` with `LLMUX_MODEL` set in the environment.
They can be inline scripts (YAML `|` syntax) or paths to executables.
### Policy
```yaml
policy:
# Max time a request waits for a switch. Omit for unlimited.
request_timeout_secs: 300
# Wait for in-flight requests to finish before switching. Default: true.
drain_before_switch: true
# Minimum seconds a model stays active before it can be switched out.
# Prevents rapid thrashing. Default: 0.
min_active_secs: 5
```
### Full config reference
```yaml
models:
<model-name>:
port: <u16> # Where the backend listens
wake: <string> # Script to start/restore the model
sleep: <string> # Script to stop/checkpoint the model
alive: <string> # Health check script
policy:
min_active_secs: <u64> # default: 0
port: <u16> # Proxy listen port (default: 3000)
```
Both YAML and JSON configs are supported (detected by file extension).
## Examples
### Podman + CRIU (GPU checkpoint/restore)
The [`examples/podman-criu/`](examples/podman-criu/) directory shows how to
use CRIU to checkpoint and restore vLLM containers, achieving ~3x faster
model switches vs. cold start. See the [example README](examples/podman-criu/README.md)
for setup instructions and timings.
## License
MIT