Skip to main content

InferenceScheduler

Struct InferenceScheduler 

Source
pub struct InferenceScheduler { /* private fields */ }
Expand description

Controls how many agents can perform inference at the same time.

This is a simple counting semaphore: agents call acquire() before running their inference loop and release() when done. If max_concurrent slots are already in use, acquire() blocks until one is freed.

§Why?

Each agent has its own LlamaContext (KV cache) which is independent and thread-safe. But all contexts share the same GPU for compute. Running too many inferences in parallel can:

  • Exhaust GPU VRAM (multiple KV caches)
  • Thrash the GPU scheduler (context switches)
  • Cause OOM errors on smaller GPUs

A scheduler with max_concurrent = 1 serializes all inference (like the worker-thread pattern in vnai::ai), while higher values allow controlled parallelism.

§Example

use llama_cpp_v3_agent_sdk::InferenceScheduler;
use std::sync::Arc;

// Allow at most 2 agents to infer concurrently:
let scheduler = Arc::new(InferenceScheduler::new(2));

// Use with AgentBuilder:
// AgentBuilder::new()
//     .engine(engine.clone())
//     .scheduler(scheduler.clone())
//     .build()?;

Implementations§

Source§

impl InferenceScheduler

Source

pub fn new(max_concurrent: usize) -> Self

Create a new scheduler with the given concurrency limit.

  • max_concurrent = 1 → fully serialized (one agent at a time)
  • max_concurrent = N → up to N agents run inference in parallel
Source

pub fn init_pool( &self, engine: &InferenceEngine, n_ctx: Option<u32>, ) -> Result<(), AgentError>

Pre-initialize the context pool with the given engine. This avoids lazy allocation during the first inference runs.

Source

pub fn acquire(&self) -> InferencePermit<'_>

Acquire a permit and a context from the pool. Blocks if all slots are in use.

Returns an RAII guard that automatically releases the slot on drop.

Source

pub fn try_acquire(&self) -> Option<InferencePermit<'_>>

Try to acquire a permit without blocking.

Returns None if all slots are in use.

Source

pub fn active_count(&self) -> usize

Number of currently active inferences.

Source

pub fn max_concurrent(&self) -> usize

Maximum allowed concurrent inferences.

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.