Expand description
Shared inference engine — load a model once, share it across agents.
This module follows the same pattern as vnai::ai::TextGeneration:
the heavy resources (LlamaBackend, LlamaModel) are wrapped in Arc
so they can be cloned cheaply and shared between multiple Agent instances.
Each agent creates its own LlamaContext (KV cache), so agents don’t
interfere with each other.
§Concurrency
- Without a scheduler: Agents run truly parallel (safe, but GPU-heavy).
- With
InferenceScheduler: A semaphore limits how many agents can run inference concurrently.max_concurrent = 1serializes all inference.
Modules§
- templates
- Common chat templates for models that lack them.
Structs§
- Inference
Config - Configuration for loading a model.
- Inference
Engine - Shared inference engine that holds the backend + model in
Arcs. - Inference
Permit - RAII guard — releases the scheduler slot and returns the context to the pool on drop.
- Inference
Scheduler - Controls how many agents can perform inference at the same time.