Skip to main content

Module model_cache

Module model_cache 

Source
Expand description

In-process model cache for GGUF files.

Avoids reloading model weights for each request by keeping a bounded set of ModelEntry values in a ModelCache. The cache uses LRU-like eviction (evict the entry with the longest idle time) when the slot limit is reached.

A companion ModelWarmup helper runs a small number of dummy inference passes on a freshly-loaded engine so that internal caches and JIT paths are primed before the first real request.

Structsยง

ModelCache
Thread-safe in-process model cache.
ModelCacheConfig
Configuration for ModelCache.
ModelCacheStats
Snapshot of cache utilisation metrics, suitable for serialisation to JSON.
ModelEntry
A single cached model entry, storing metadata about a loaded model.
ModelWarmup
Runs a small number of dummy inference passes on a freshly-initialised InferenceEngine to prime internal allocation caches and JIT paths before the first real request arrives.