Expand description
LRU-evicting cache for loaded inference backends.
Solves three problems at once:
-
Cold start on every call. Before:
FluxBackend::load()/LtxBackend::load()/KokoroBackend::load()ran for everygenerate_image/generate_video/synth_ttsrequest, paying the 1–14 s model-load cost on a hot path. After: first call loads, subsequent calls get a cheapArc<Mutex<T>>handle. -
Concurrent calls racing on the same backend.
mlx-rsArrayis!Sync; two tokio tasks calling the same backend simultaneously is undefined behavior. The per-entryMutexserializes concurrent callers onto the same backend. Different backends still run in parallel (MLX itself queues them on the single Metal driver). -
Unbounded RAM growth. Before: loading Flux + LTX + Kokoro + every text-gen model held ~30 GB of quantized weights forever. The cache tracks an approximate per-entry size (sum of the model directory’s
.safetensorsbytes) and evicts LRU entries once the total exceedsbudget_bytes. Default budget viaCAR_INFERENCE_MODEL_CACHE_MB(default 24 GB). Set to 0 to effectively disable caching.
The cache is generic over T: Send + 'static. Inference backends
don’t need to implement any trait — just wrap them on insert.
Invariant: an evicted entry is only removed from the cache map; any
outstanding Arc<Mutex<T>> handle continues to work until the last
caller drops it. This makes eviction safe even during a long-running
inference call — RAM is reclaimed lazily when the last user finishes.
Structs§
- Backend
Cache - LRU-bounded cache of loaded inference backends.
Functions§
- estimate_
model_ size - Sum the sizes of all
.safetensorsfiles undermodel_dir. A loose upper-bound RAM estimate for the loaded model (quantized tensors live in MLX-owned memory roughly matching their on-disk size).
Type Aliases§
- Cached
Backend - Handle to a cached backend. Callers lock the inner mutex for the duration of an inference call to serialize with concurrent requests for the same model.