Expand description
§llmux
Zero-reload model switching for vLLM - manages multiple models on shared GPU.
This crate provides:
- Orchestrator: Lazily starts vLLM processes on first request
- Switcher: Coordinates wake/sleep between models
- Middleware: Axum layer that integrates with onwards proxy
§Architecture
┌─────────────────────────────────────────────────────────────┐
│ llmux │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Orchestrator │ │
│ │ - Spawns vLLM processes lazily │ │
│ │ - Tracks: NotStarted | Starting | Running | Sleeping │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Middleware Layer │ │
│ │ - Extracts model from request │ │
│ │ - Ensures model ready before forwarding │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Onwards Proxy │ │
│ │ - Routes to vLLM by model name │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ ▼ ▼ ▼ │
│ [vLLM:8001] [vLLM:8002] [vLLM:8003] │
│ (llama) (mistral) (qwen) │
└─────────────────────────────────────────────────────────────┘Modules§
- validate
- Validation tool for sleep/wake cycles
Structs§
- Checkpoint
Config - Configuration for CUDA/CRIU-based checkpointing (sleep levels 3 and 4).
- Config
- Top-level configuration
- Cost
Aware Policy - Cost-aware coalescing policy.
- Fifo
Policy - FIFO policy - switch immediately on first request
- Model
Config - Configuration for a single model
- Model
Switcher - The model switcher coordinates wake/sleep transitions
- Model
Switcher Layer - Layer that adds model switching to a service
- Model
Switcher Service - Service that wraps requests with model switching
- Orchestrator
- Orchestrator manages vLLM process lifecycle
- Policy
Config - Policy configuration
- Policy
Context - Context provided to policies when making switch decisions
- Schedule
Context - Context provided to the background scheduler on each tick
- Time
Slice Policy - Drain-first scheduling policy with a proactive background scheduler.
Enums§
- Orchestrator
Error - Errors from the orchestrator
- Policy
Decision - Decision returned by policy
- Process
State - State of a model’s vLLM process
- Sleep
Level - Sleep level for hibernating models
- Switch
Error - Errors from the switcher
- Switcher
State - State of the model switcher
Traits§
- Switch
Policy - Policy trait for controlling model switching behavior
Functions§
- build_
app - Build the complete llmux stack