pub struct InferenceScheduler { /* private fields */ }Expand description
Controls how many agents can perform inference at the same time.
This is a simple counting semaphore: agents call acquire() before running
their inference loop and release() when done. If max_concurrent slots
are already in use, acquire() blocks until one is freed.
§Why?
Each agent has its own LlamaContext (KV cache) which is independent and
thread-safe. But all contexts share the same GPU for compute. Running too
many inferences in parallel can:
- Exhaust GPU VRAM (multiple KV caches)
- Thrash the GPU scheduler (context switches)
- Cause OOM errors on smaller GPUs
A scheduler with max_concurrent = 1 serializes all inference (like the
worker-thread pattern in vnai::ai), while higher values allow controlled
parallelism.
§Example
use llama_cpp_v3_agent_sdk::InferenceScheduler;
use std::sync::Arc;
// Allow at most 2 agents to infer concurrently:
let scheduler = Arc::new(InferenceScheduler::new(2));
// Use with AgentBuilder:
// AgentBuilder::new()
// .engine(engine.clone())
// .scheduler(scheduler.clone())
// .build()?;Implementations§
Source§impl InferenceScheduler
impl InferenceScheduler
Sourcepub fn new(max_concurrent: usize) -> Self
pub fn new(max_concurrent: usize) -> Self
Create a new scheduler with the given concurrency limit.
max_concurrent = 1→ fully serialized (one agent at a time)max_concurrent = N→ up to N agents run inference in parallel
Sourcepub fn init_pool(
&self,
engine: &InferenceEngine,
n_ctx: Option<u32>,
) -> Result<(), AgentError>
pub fn init_pool( &self, engine: &InferenceEngine, n_ctx: Option<u32>, ) -> Result<(), AgentError>
Pre-initialize the context pool with the given engine. This avoids lazy allocation during the first inference runs.
Sourcepub fn acquire(&self) -> InferencePermit<'_>
pub fn acquire(&self) -> InferencePermit<'_>
Acquire a permit and a context from the pool. Blocks if all slots are in use.
Returns an RAII guard that automatically releases the slot on drop.
Sourcepub fn try_acquire(&self) -> Option<InferencePermit<'_>>
pub fn try_acquire(&self) -> Option<InferencePermit<'_>>
Try to acquire a permit without blocking.
Returns None if all slots are in use.
Sourcepub fn active_count(&self) -> usize
pub fn active_count(&self) -> usize
Number of currently active inferences.
Sourcepub fn max_concurrent(&self) -> usize
pub fn max_concurrent(&self) -> usize
Maximum allowed concurrent inferences.