pub struct AsyncInferenceEngine { /* private fields */ }Expand description
Async inference engine with bounded concurrency.
Wraps a synchronous InferenceEngine and provides async methods
that use spawn_blocking under the hood. The semaphore limits how
many concurrent inference requests can be in flight, protecting
both memory and CPU utilization.
Implementations§
Source§impl AsyncInferenceEngine
impl AsyncInferenceEngine
Sourcepub fn new(engine: InferenceEngine<'static>, max_concurrent: usize) -> Self
pub fn new(engine: InferenceEngine<'static>, max_concurrent: usize) -> Self
Create a new async inference engine wrapping the given engine.
max_concurrent controls how many inference requests may execute
concurrently. A value of 1 serializes all requests.
Sourcepub fn with_metrics(self, metrics: Arc<InferenceMetrics>) -> Self
pub fn with_metrics(self, metrics: Arc<InferenceMetrics>) -> Self
Attach shared metrics for recording inference telemetry.
Sourcepub async fn generate(
&self,
prompt_tokens: Vec<u32>,
max_tokens: usize,
) -> RuntimeResult<Vec<u32>>
pub async fn generate( &self, prompt_tokens: Vec<u32>, max_tokens: usize, ) -> RuntimeResult<Vec<u32>>
Generate tokens asynchronously.
Blocks the caller until a semaphore permit is acquired, then dispatches the CPU-bound generation to a blocking thread.
Sourcepub async fn generate_streaming(
&self,
prompt_tokens: Vec<u32>,
max_tokens: usize,
) -> RuntimeResult<UnboundedReceiver<u32>>
pub async fn generate_streaming( &self, prompt_tokens: Vec<u32>, max_tokens: usize, ) -> RuntimeResult<UnboundedReceiver<u32>>
Generate tokens with streaming via an unbounded channel.
Returns a receiver that yields tokens as they are generated. The generation happens on a blocking thread; the receiver can be consumed asynchronously.
Sourcepub fn active_requests(&self) -> usize
pub fn active_requests(&self) -> usize
Current number of active (in-flight) requests.
Computed as max_concurrent - available_permits.
Sourcepub fn max_concurrent(&self) -> usize
pub fn max_concurrent(&self) -> usize
Maximum concurrent requests this engine allows.
Sourcepub fn has_capacity(&self) -> bool
pub fn has_capacity(&self) -> bool
Check if the engine has capacity for at least one more request.
Sourcepub fn engine(&self) -> &Arc<Mutex<InferenceEngine<'static>>>
pub fn engine(&self) -> &Arc<Mutex<InferenceEngine<'static>>>
Get a reference to the underlying engine (behind a mutex).
Auto Trait Implementations§
impl Freeze for AsyncInferenceEngine
impl !RefUnwindSafe for AsyncInferenceEngine
impl Send for AsyncInferenceEngine
impl Sync for AsyncInferenceEngine
impl Unpin for AsyncInferenceEngine
impl UnsafeUnpin for AsyncInferenceEngine
impl !UnwindSafe for AsyncInferenceEngine
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more