Module inference

Expand description

Shared inference engine — load a model once, share it across agents.

This module follows the same pattern as vnai::ai::TextGeneration: the heavy resources (LlamaBackend, LlamaModel) are wrapped in Arc so they can be cloned cheaply and shared between multiple Agent instances. Each agent creates its own LlamaContext (KV cache), so agents don’t interfere with each other.

§Concurrency

Without a scheduler: Agents run truly parallel (safe, but GPU-heavy).
With InferenceScheduler: A semaphore limits how many agents can run inference concurrently. max_concurrent = 1 serializes all inference.

Modules§

templates: Common chat templates for models that lack them.

Structs§

InferenceConfig: Configuration for loading a model.
InferenceEngine: Shared inference engine that holds the backend + model in Arcs.
InferencePermit: RAII guard — releases the scheduler slot and returns the context to the pool on drop.
InferenceScheduler: Controls how many agents can perform inference at the same time.

Module inference

Module inference Copy item path

§Concurrency

Modules§

Structs§

Module inference