Skip to main content

Module inference

Module inference 

Source
Expand description

Shared inference engine — load a model once, share it across agents.

This module follows the same pattern as vnai::ai::TextGeneration: the heavy resources (LlamaBackend, LlamaModel) are wrapped in Arc so they can be cloned cheaply and shared between multiple Agent instances. Each agent creates its own LlamaContext (KV cache), so agents don’t interfere with each other.

§Concurrency

  • Without a scheduler: Agents run truly parallel (safe, but GPU-heavy).
  • With InferenceScheduler: A semaphore limits how many agents can run inference concurrently. max_concurrent = 1 serializes all inference.

Modules§

templates
Common chat templates for models that lack them.

Structs§

InferenceConfig
Configuration for loading a model.
InferenceEngine
Shared inference engine that holds the backend + model in Arcs.
InferencePermit
RAII guard — releases the scheduler slot and returns the context to the pool on drop.
InferenceScheduler
Controls how many agents can perform inference at the same time.