Skip to main content

Crate llmux

Crate llmux 

Source
Expand description

§llmux

Zero-reload model switching for vLLM - manages multiple models on shared GPU.

This crate provides:

  • Orchestrator: Lazily starts vLLM processes on first request
  • Switcher: Coordinates wake/sleep between models
  • Middleware: Axum layer that integrates with onwards proxy

§Architecture

┌─────────────────────────────────────────────────────────────┐
│                     llmux                          │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Orchestrator                                         │   │
│  │ - Spawns vLLM processes lazily                       │   │
│  │ - Tracks: NotStarted | Starting | Running | Sleeping │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Middleware Layer                                     │   │
│  │ - Extracts model from request                        │   │
│  │ - Ensures model ready before forwarding              │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Onwards Proxy                                        │   │
│  │ - Routes to vLLM by model name                       │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│      ┌───────────────────┼───────────────────┐             │
│      ▼                   ▼                   ▼             │
│  [vLLM:8001]        [vLLM:8002]         [vLLM:8003]        │
│   (llama)           (mistral)           (qwen)            │
└─────────────────────────────────────────────────────────────┘

Modules§

control
Control API for manual model management.
object_store
S3-compatible object store for checkpoint images.
validate
Validation tool for sleep/wake cycles

Structs§

CheckpointConfig
Configuration for CUDA/CRIU-based checkpointing.
Config
Top-level configuration
CostAwarePolicy
Cost-aware coalescing policy.
EvictionPolicy
Two-axis eviction policy: weight management x process management.
FifoPolicy
FIFO policy - switch immediately on first request
ModelConfig
Configuration for a single model.
ModelSwitcher
The model switcher coordinates wake/sleep transitions
ModelSwitcherLayer
Layer that adds model switching to a service
ModelSwitcherService
Service that wraps requests with model switching
ObjectStoreConfig
S3-compatible object store configuration for checkpoint persistence.
Orchestrator
Orchestrator manages vLLM process lifecycle
PolicyConfig
Policy configuration
PolicyContext
Context provided to policies when making switch decisions
ScheduleContext
Context provided to the background scheduler on each tick
TimeSlicePolicy
Drain-first scheduling policy with a proactive background scheduler.

Enums§

OrchestratorError
Errors from the orchestrator
PolicyDecision
Decision returned by policy
ProcessState
State of a model’s vLLM process
ProcessStrategy
What to do with the OS process after weight strategy is applied.
SwitchError
Errors from the switcher
SwitchMode
Switch mode controls whether model switching is automatic or manual.
SwitcherState
State of the model switcher
WeightStrategy
What to do with model weights when freeing GPU memory.

Traits§

SwitchPolicy
Policy trait for controlling model switching behavior

Functions§

build_app
Build the complete llmux stack
run_warmup
Run the warmup phase: start each model, run one inference, then sleep it.