Expand description
In-process model cache for GGUF files.
Avoids reloading model weights for each request by keeping a bounded set of
ModelEntry values in a ModelCache. The cache uses LRU-like eviction
(evict the entry with the longest idle time) when the slot limit is reached.
A companion ModelWarmup helper runs a small number of dummy inference
passes on a freshly-loaded engine so that internal caches and JIT paths are
primed before the first real request.
Structsยง
- Model
Cache - Thread-safe in-process model cache.
- Model
Cache Config - Configuration for
ModelCache. - Model
Cache Stats - Snapshot of cache utilisation metrics, suitable for serialisation to JSON.
- Model
Entry - A single cached model entry, storing metadata about a loaded model.
- Model
Warmup - Runs a small number of dummy inference passes on a freshly-initialised
InferenceEngineto prime internal allocation caches and JIT paths before the first real request arrives.