Skip to main content

Module backend_cache

Module backend_cache 

Source
Expand description

LRU-evicting cache for loaded inference backends.

Solves three problems at once:

  1. Cold start on every call. Before: FluxBackend::load() / LtxBackend::load() / KokoroBackend::load() ran for every generate_image / generate_video / synth_tts request, paying the 1–14 s model-load cost on a hot path. After: first call loads, subsequent calls get a cheap Arc<Mutex<T>> handle.

  2. Concurrent calls racing on the same backend. mlx-rs Array is !Sync; two tokio tasks calling the same backend simultaneously is undefined behavior. The per-entry Mutex serializes concurrent callers onto the same backend. Different backends still run in parallel (MLX itself queues them on the single Metal driver).

  3. Unbounded RAM growth. Before: loading Flux + LTX + Kokoro + every text-gen model held ~30 GB of quantized weights forever. The cache tracks an approximate per-entry size (sum of the model directory’s .safetensors bytes) and evicts LRU entries once the total exceeds budget_bytes. Default budget via CAR_INFERENCE_MODEL_CACHE_MB (default 24 GB). Set to 0 to effectively disable caching.

The cache is generic over T: Send + 'static. Inference backends don’t need to implement any trait — just wrap them on insert.

Invariant: an evicted entry is only removed from the cache map; any outstanding Arc<Mutex<T>> handle continues to work until the last caller drops it. This makes eviction safe even during a long-running inference call — RAM is reclaimed lazily when the last user finishes.

Structs§

BackendCache
LRU-bounded cache of loaded inference backends.

Functions§

estimate_model_size
Sum the sizes of all .safetensors files under model_dir. A loose upper-bound RAM estimate for the loaded model (quantized tensors live in MLX-owned memory roughly matching their on-disk size).

Type Aliases§

CachedBackend
Handle to a cached backend. Callers lock the inner mutex for the duration of an inference call to serialize with concurrent requests for the same model.