llmux
LLM multiplexer. Routes OpenAI-compatible requests to model backends and switches between them on demand using user-provided scripts.
When a request arrives for a model that isn't currently loaded, llmux drains
in-flight requests, runs your sleep script on the active model, then runs
your wake script on the requested model. The API stays up throughout —
clients just change the model field.
llmux doesn't manage model processes directly. You provide three shell scripts per model (wake, sleep, alive) and llmux calls them at the right time. This means it works with any backend — vLLM, SGLang, llama.cpp, Ollama, or anything else that speaks HTTP.
Install
Quick start
Create a config.yaml:
models:
llama:
port: 8001
wake: ./scripts/wake-llama.sh
sleep: ./scripts/sleep-llama.sh
alive: curl -sf http://localhost:8001/health
mistral:
port: 8002
wake: ./scripts/wake-mistral.sh
sleep: ./scripts/sleep-mistral.sh
alive: curl -sf http://localhost:8002/health
port: 3000
Run it:
Send requests:
When a request comes in for mistral, llmux will drain active requests,
run sleep-llama.sh, then wake-mistral.sh, and proxy the request through.
How it works
Client requests
|
+---------+
| llmux | port 3000 (OpenAI-compatible proxy)
+---------+
/ \
[8001] [8002]
llama mistral
(active) (sleeping)
- Middleware extracts the
modelfield from the request JSON body - Switcher checks if that model is active. If not, triggers a switch:
- Drains in-flight requests for the current model
- Runs the sleep hook on the current model
- Runs the wake hook on the target model
- Proxy forwards the request to
localhost:<model_port> - In-flight tracking uses RAII guards that hold through streaming responses
Configuration
Models
Each model needs a port and three hooks:
models:
my-model:
port: 8001
wake: |
# Bring the model to a ready state (must be idempotent).
# Exit 0 when the model is ready to serve requests.
docker start my-model-container
for i in $(seq 1 60); do
curl -sf http://localhost:8001/health && exit 0
sleep 1
done
exit 1
sleep: |
# Free resources. Exit 0 when done.
docker stop my-model-container
alive: |
# Health check. Exit 0 = healthy, non-zero = unhealthy.
curl -sf http://localhost:8001/health
Hooks are executed via sh -c with LLMUX_MODEL set in the environment.
They can be inline scripts (YAML | syntax) or paths to executables.
Policy
policy:
# Max time a request waits for a switch. Omit for unlimited.
request_timeout_secs: 300
# Wait for in-flight requests to finish before switching. Default: true.
drain_before_switch: true
# Minimum seconds a model stays active before it can be switched out.
# Prevents rapid thrashing. Default: 0.
min_active_secs: 5
Full config reference
models:
<model-name>:
port: <u16> # Where the backend listens
wake: <string> # Script to start/restore the model
sleep: <string> # Script to stop/checkpoint the model
alive: <string> # Health check script
policy:
request_timeout_secs: <u64 | null> # null = unlimited
drain_before_switch: <bool> # default: true
min_active_secs: <u64> # default: 0
port: <u16> # Proxy listen port (default: 3000)
Both YAML and JSON configs are supported (detected by file extension).
Examples
Podman + CRIU (GPU checkpoint/restore)
The examples/podman-criu/ directory shows how to
use CRIU to checkpoint and restore vLLM containers, achieving ~3x faster
model switches vs. cold start. See the example README
for setup instructions and timings.
License
MIT