Crate llmux

Expand description

§llmux

Zero-reload model switching for vLLM - manages multiple models on shared GPU.

This crate provides:

Orchestrator: Lazily starts vLLM processes on first request
Switcher: Coordinates wake/sleep between models
Middleware: Axum layer that integrates with onwards proxy

§Architecture

┌─────────────────────────────────────────────────────────────┐
│                     llmux                          │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Orchestrator                                         │   │
│  │ - Spawns vLLM processes lazily                       │   │
│  │ - Tracks: NotStarted | Starting | Running | Sleeping │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Middleware Layer                                     │   │
│  │ - Extracts model from request                        │   │
│  │ - Ensures model ready before forwarding              │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Onwards Proxy                                        │   │
│  │ - Routes to vLLM by model name                       │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│      ┌───────────────────┼───────────────────┐             │
│      ▼                   ▼                   ▼             │
│  [vLLM:8001]        [vLLM:8002]         [vLLM:8003]        │
│   (llama)           (mistral)           (qwen)            │
└─────────────────────────────────────────────────────────────┘

Modules§

validate: Validation tool for sleep/wake cycles

Structs§

CheckpointConfig: Configuration for CUDA/CRIU-based checkpointing (sleep levels 3 and 4).
Config: Top-level configuration
CostAwarePolicy: Cost-aware coalescing policy.
FifoPolicy: FIFO policy - switch immediately on first request
ModelConfig: Configuration for a single model
ModelSwitcher: The model switcher coordinates wake/sleep transitions
ModelSwitcherLayer: Layer that adds model switching to a service
ModelSwitcherService: Service that wraps requests with model switching
Orchestrator: Orchestrator manages vLLM process lifecycle
PolicyConfig: Policy configuration
PolicyContext: Context provided to policies when making switch decisions
ScheduleContext: Context provided to the background scheduler on each tick
TimeSlicePolicy: Drain-first scheduling policy with a proactive background scheduler.

Enums§

OrchestratorError: Errors from the orchestrator
PolicyDecision: Decision returned by policy
ProcessState: State of a model’s vLLM process
SleepLevel: Sleep level for hibernating models
SwitchError: Errors from the switcher
SwitcherState: State of the model switcher

Traits§

SwitchPolicy: Policy trait for controlling model switching behavior

Functions§

build_app: Build the complete llmux stack

Crate llmux

Crate llmux Copy item path

§llmux

§Architecture

Modules§

Structs§

Enums§

Traits§

Functions§

Crate llmux