Expand description
§llmux
Zero-reload model switching for vLLM - manages multiple models on shared GPU.
This crate provides:
- Orchestrator: Lazily starts vLLM processes on first request
- Switcher: Coordinates wake/sleep between models
- Middleware: Axum layer that integrates with onwards proxy
§Architecture
┌─────────────────────────────────────────────────────────────┐
│ llmux │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Orchestrator │ │
│ │ - Spawns vLLM processes lazily │ │
│ │ - Tracks: NotStarted | Starting | Running | Sleeping │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Middleware Layer │ │
│ │ - Extracts model from request │ │
│ │ - Ensures model ready before forwarding │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Onwards Proxy │ │
│ │ - Routes to vLLM by model name │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ ▼ ▼ ▼ │
│ [vLLM:8001] [vLLM:8002] [vLLM:8003] │
│ (llama) (mistral) (qwen) │
└─────────────────────────────────────────────────────────────┘Modules§
- control
- Control API for manual model management.
- object_
store - S3-compatible object store for checkpoint images.
- validate
- Validation tool for sleep/wake cycles
Structs§
- Checkpoint
Config - Configuration for CUDA/CRIU-based checkpointing.
- Config
- Top-level configuration
- Cost
Aware Policy - Cost-aware coalescing policy.
- Eviction
Policy - Two-axis eviction policy: weight management x process management.
- Fifo
Policy - FIFO policy - switch immediately on first request
- Model
Config - Configuration for a single model.
- Model
Switcher - The model switcher coordinates wake/sleep transitions
- Model
Switcher Layer - Layer that adds model switching to a service
- Model
Switcher Service - Service that wraps requests with model switching
- Object
Store Config - S3-compatible object store configuration for checkpoint persistence.
- Orchestrator
- Orchestrator manages vLLM process lifecycle
- Policy
Config - Policy configuration
- Policy
Context - Context provided to policies when making switch decisions
- Schedule
Context - Context provided to the background scheduler on each tick
- Time
Slice Policy - Drain-first scheduling policy with a proactive background scheduler.
Enums§
- Orchestrator
Error - Errors from the orchestrator
- Policy
Decision - Decision returned by policy
- Process
State - State of a model’s vLLM process
- Process
Strategy - What to do with the OS process after weight strategy is applied.
- Switch
Error - Errors from the switcher
- Switch
Mode - Switch mode controls whether model switching is automatic or manual.
- Switcher
State - State of the model switcher
- Weight
Strategy - What to do with model weights when freeing GPU memory.
Traits§
- Switch
Policy - Policy trait for controlling model switching behavior
Functions§
- build_
app - Build the complete llmux stack
- run_
warmup - Run the warmup phase: start each model, run one inference, then sleep it.