Skip to main content

Crate llmux

Crate llmux 

Source
Expand description

§llmux

Zero-reload model switching for vLLM - manages multiple models on shared GPU.

This crate provides:

  • Orchestrator: Lazily starts vLLM processes on first request
  • Switcher: Coordinates wake/sleep between models
  • Middleware: Axum layer that integrates with onwards proxy

§Architecture

┌─────────────────────────────────────────────────────────────┐
│                     llmux                          │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Orchestrator                                         │   │
│  │ - Spawns vLLM processes lazily                       │   │
│  │ - Tracks: NotStarted | Starting | Running | Sleeping │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Middleware Layer                                     │   │
│  │ - Extracts model from request                        │   │
│  │ - Ensures model ready before forwarding              │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Onwards Proxy                                        │   │
│  │ - Routes to vLLM by model name                       │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│      ┌───────────────────┼───────────────────┐             │
│      ▼                   ▼                   ▼             │
│  [vLLM:8001]        [vLLM:8002]         [vLLM:8003]        │
│   (llama)           (mistral)           (qwen)            │
└─────────────────────────────────────────────────────────────┘

Modules§

validate
Validation tool for sleep/wake cycles

Structs§

CheckpointConfig
Configuration for CUDA/CRIU-based checkpointing (sleep levels 3 and 4).
Config
Top-level configuration
CostAwarePolicy
Cost-aware coalescing policy.
FifoPolicy
FIFO policy - switch immediately on first request
ModelConfig
Configuration for a single model
ModelSwitcher
The model switcher coordinates wake/sleep transitions
ModelSwitcherLayer
Layer that adds model switching to a service
ModelSwitcherService
Service that wraps requests with model switching
Orchestrator
Orchestrator manages vLLM process lifecycle
PolicyConfig
Policy configuration
PolicyContext
Context provided to policies when making switch decisions
ScheduleContext
Context provided to the background scheduler on each tick
TimeSlicePolicy
Drain-first scheduling policy with a proactive background scheduler.

Enums§

OrchestratorError
Errors from the orchestrator
PolicyDecision
Decision returned by policy
ProcessState
State of a model’s vLLM process
SleepLevel
Sleep level for hibernating models
SwitchError
Errors from the switcher
SwitcherState
State of the model switcher

Traits§

SwitchPolicy
Policy trait for controlling model switching behavior

Functions§

build_app
Build the complete llmux stack