Crate llmux

Expand description

§llmux

Zero-reload model switching for vLLM - manages multiple models on shared GPU.

This crate provides:

Orchestrator: Lazily starts vLLM processes on first request
Switcher: Coordinates wake/sleep between models
Middleware: Axum layer that integrates with onwards proxy

§Architecture

┌─────────────────────────────────────────────────────────────┐
│                     llmux                          │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Orchestrator                                         │   │
│  │ - Spawns vLLM processes lazily                       │   │
│  │ - Tracks: NotStarted | Starting | Running | Sleeping │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Middleware Layer                                     │   │
│  │ - Extracts model from request                        │   │
│  │ - Ensures model ready before forwarding              │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Onwards Proxy                                        │   │
│  │ - Routes to vLLM by model name                       │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│      ┌───────────────────┼───────────────────┐             │
│      ▼                   ▼                   ▼             │
│  [vLLM:8001]        [vLLM:8002]         [vLLM:8003]        │
│   (llama)           (mistral)           (qwen)            │
└─────────────────────────────────────────────────────────────┘

Modules§

control: Control API for manual model management.
object_store: S3-compatible object store for checkpoint images.
validate: Validation tool for sleep/wake cycles

Structs§

CheckpointConfig: Configuration for CUDA/CRIU-based checkpointing.
Config: Top-level configuration
CostAwarePolicy: Cost-aware coalescing policy.
EvictionPolicy: Two-axis eviction policy: weight management x process management.
FifoPolicy: FIFO policy - switch immediately on first request
ModelConfig: Configuration for a single model.
ModelSwitcher: The model switcher coordinates wake/sleep transitions
ModelSwitcherLayer: Layer that adds model switching to a service
ModelSwitcherService: Service that wraps requests with model switching
ObjectStoreConfig: S3-compatible object store configuration for checkpoint persistence.
Orchestrator: Orchestrator manages vLLM process lifecycle
PolicyConfig: Policy configuration
PolicyContext: Context provided to policies when making switch decisions
ScheduleContext: Context provided to the background scheduler on each tick
TimeSlicePolicy: Drain-first scheduling policy with a proactive background scheduler.

Enums§

OrchestratorError: Errors from the orchestrator
PolicyDecision: Decision returned by policy
ProcessState: State of a model’s vLLM process
ProcessStrategy: What to do with the OS process after weight strategy is applied.
SwitchError: Errors from the switcher
SwitchMode: Switch mode controls whether model switching is automatic or manual.
SwitcherState: State of the model switcher
WeightStrategy: What to do with model weights when freeing GPU memory.

Traits§

SwitchPolicy: Policy trait for controlling model switching behavior

Functions§

build_app: Build the complete llmux stack
run_warmup: Run the warmup phase: start each model, run one inference, then sleep it.

Crate llmux

Crate llmux Copy item path

§llmux

§Architecture

Modules§

Structs§

Enums§

Traits§

Functions§

Crate llmux