Struct InferenceEngine

Source

pub struct InferenceEngine {
    pub config: InferenceConfig,
    pub unified_registry: UnifiedRegistry,
    pub adaptive_router: AdaptiveRouter,
    pub outcome_tracker: Arc<RwLock<OutcomeTracker>>,
    pub registry: ModelRegistry,
    pub router: ModelRouter,
    /* private fields */
}

Expand description

The main inference engine. Thread-safe, lazily loads models.

Now includes the unified registry, adaptive router, and outcome tracker for schema-driven model selection with learned performance profiles.

Fields§

§config: InferenceConfig§unified_registry: UnifiedRegistry

Unified model registry (local + remote).

§adaptive_router: AdaptiveRouter

Adaptive router with three-phase selection.

§outcome_tracker: Arc<RwLock<OutcomeTracker>>

Outcome tracker for learning from results.

§registry: ModelRegistry§router: ModelRouter

Implementations§

Source §

impl InferenceEngine

Source

pub fn new(config: InferenceConfig) -> Self

Source

pub async fn init_key_pool(&self)

Initialize key pool: register keys from all remote models and load persisted stats. Call this after construction (requires async).

Source

pub async fn warm_up<S: AsRef<str>>( &self, _schema_ids: &[S], ) -> Vec<Result<(), InferenceError>>

No-op on non-macOS — MLX doesn’t run here.

Source

pub async fn route_adaptive(&self, prompt: &str) -> AdaptiveRoutingDecision

Route a prompt using the adaptive router (new). Returns full decision context.

Source

pub fn route(&self, prompt: &str) -> RoutingDecision

Route a prompt to the best model without executing (legacy compat).

Source

pub fn estimated_tokens( &self, req: &GenerateRequest, model_id: Option<&str>, ) -> (usize, usize, bool)

Estimate token count for a request against a specific model’s context window. Returns (estimated_input_tokens, context_window_tokens, fits).

Source

pub async fn generate_tracked( &self, req: GenerateRequest, ) -> Result<InferenceResult, InferenceError>

Generate text from a prompt with outcome tracking. Returns InferenceResult with trace_id for reporting outcomes.

Source

pub async fn generate_tracked_stream( &self, req: GenerateRequest, ) -> Result<Receiver<StreamEvent>, InferenceError>

Stream a response with real-time token output.

Returns a channel receiver yielding StreamEvent variants:

TextDelta(String) — partial text token
ToolCallStart { name, index, id } — tool call begins
ToolCallDelta { index, arguments_delta } — partial tool arguments

Use StreamAccumulator to collect events into a final result.

§Local streaming behavior

Local models (both MLX and Candle backends) emit true incremental TextDelta events as each token is generated. This enables:

Token-by-token UI updates
Overlapping speech synthesis with generation (for voice apps)
Early cancellation when a stop sequence is detected

The channel capacity is 64 events, providing buffering for burst tokens without blocking the generation loop.

§Example: voice app integration

let mut rx = engine.generate_tracked_stream(req).await?;
let mut text_buf = String::new();
while let Some(event) = rx.recv().await {
    match event {
        StreamEvent::TextDelta(delta) => {
            text_buf.push_str(&delta);
            // Feed text_buf to TTS when a sentence boundary is reached
        }
        StreamEvent::Done { text, .. } => break,
        _ => {}
    }
}

Source

pub async fn route_context_snapshot( &self, prompt: &str, workload: RoutingWorkload, has_tools: bool, has_vision: bool, ) -> AdaptiveRoutingDecision

Route a prompt using the adaptive router without executing inference.

Source

pub async fn generate( &self, req: GenerateRequest, ) -> Result<String, InferenceError>

Generate text from a prompt (legacy API, no outcome tracking). When req.model is None, uses intelligent routing based on prompt complexity.

Source

pub async fn tokenize( &self, model: &str, text: &str, ) -> Result<Vec<u32>, InferenceError>

Encode text via the named model’s tokenizer. Returns raw token IDs without any chat-template wrapping or BOS-prepending — pair with Self::detokenize for the round-trip property detokenize(model, tokenize(model, s)) == s for any UTF-8 s.

Only local models have a tokenizer the runtime can call directly (Candle/GGUF on Linux/Windows, MLX on Apple Silicon). For remote models the call returns InferenceError::UnsupportedMode — provider tokenizer endpoints vary too widely to be portable here, and bundling tiktoken-style tables would lock the registry to a fixed set of providers.

Source

pub async fn detokenize( &self, model: &str, tokens: &[u32], ) -> Result<String, InferenceError>

Inverse of Self::tokenize: decode token IDs back to text.

Source

pub async fn embed( &self, req: EmbedRequest, ) -> Result<Vec<Vec<f32>>, InferenceError>

Generate embeddings for text using the dedicated embedding model. On Apple Silicon, uses the native MLX backend; on other platforms, uses Candle.

Source

pub async fn rerank( &self, req: RerankRequest, ) -> Result<RerankResult, InferenceError>

Rerank candidate documents against a query using a cross-encoder reranker model (Qwen3-Reranker family). Returns documents sorted by descending relevance.

§Scoring

Qwen3-Reranker is a Qwen3 base LM fine-tuned so that the first assistant token is "yes" or "no" given the templated <Instruct>/<Query>/<Document> user turn. We run a short greedy decode (≤ 3 tokens, so a leading space, BOS artifact, or the occasional newline don’t break us) and score yes → 1.0, no → 0.0, anything else → 0.5 with a warning.

This is a binary score — the soft probability softmax(logit_yes, logit_no) would give finer ordering but requires per-token logit access on [backend::MlxBackend], which isn’t exposed publicly yet. Tracked as a follow-up; binary scores still produce a correct partial ordering, just with coarser tiebreaks within the {yes} or {no} groups.

§Prompt template

We emit the upstream Qwen3-Reranker chat template verbatim: a dedicated system prompt fixing the yes/no answer space, then the user turn with <Instruct>/<Query>/<Document>, then the assistant prefix with a closed empty <think> block to suppress thinking (reranker is not a reasoner — it’s a classifier). Deviating from this template produces sharply degraded yes/no distributions.

Source

pub async fn ground( &self, req: GroundRequest, ) -> Result<GroundResult, InferenceError>

Dedicated endpoint for structured visual grounding.

Runs a VL generate call under the hood and parses Qwen2.5-VL’s inline <|object_ref_*|>...<|box_*|>(x1,y1),(x2,y2) spans into typed BoundingBoxes. Distinct from the generic InferenceEngine::generate + InferenceResult.bounding_boxes path so callers can express “I want boxes” as a first-class intent — which also lets the router prefer models that declare the Grounding capability.

Source

pub async fn classify( &self, req: ClassifyRequest, ) -> Result<Vec<ClassifyResult>, InferenceError>

Classify text against candidate labels. When req.model is None, routes to the smallest available model.

Source

pub async fn transcribe( &self, req: TranscribeRequest, ) -> Result<TranscribeResult, InferenceError>

Transcribe an audio file using the best available STT model.

Source

pub async fn synthesize( &self, req: SynthesizeRequest, ) -> Result<SynthesizeResult, InferenceError>

Synthesize speech using the best available TTS model.

Source

pub async fn generate_image( &self, req: GenerateImageRequest, ) -> Result<GenerateImageResult, InferenceError>

Generate an image using the best available local MLX image model.

Source

pub async fn generate_image_batch( &self, req: GenerateImageRequest, ) -> Result<Vec<GenerateImageResult>, InferenceError>

Generate one or more variants in a single call.

Returns req.variant_count results (defaulting to 1). The current MLX Flux backend doesn’t support native batching, so this loops over generate_image with the seed advanced per variant for visual diversity. A future hosted backend (gpt-image-2, Replicate) can short-circuit this with one network call producing N coherent images.

Per-variant errors abort the batch — there’s no partial- success semantics today. Callers needing more lenient behaviour should call generate_image directly in their own loop.

Closes #110.

Source

pub async fn generate_video( &self, req: GenerateVideoRequest, ) -> Result<GenerateVideoResult, InferenceError>

Generate a video using the best available local MLX video model.

Source

pub fn list_models_unified(&self) -> Vec<ModelInfo>

List all known models and their status (new registry).

Source

pub fn available_model_upgrades(&self) -> Vec<ModelUpgrade>

Report installed models that have curated newer replacements.

Source

pub fn list_schemas(&self) -> Vec<ModelSchema>

List all known models and their download status (legacy). List all model schemas from the unified registry (full metadata).

Source

pub fn list_models(&self) -> Vec<ModelInfo>

Source

pub async fn pull_model(&self, name: &str) -> Result<PathBuf, InferenceError>

Download a model if not already present.

Source

pub fn remove_model(&self, name: &str) -> Result<(), InferenceError>

Remove a downloaded model.

Source

pub fn register_model(&mut self, schema: ModelSchema)

Register a model in the unified registry.

Source

pub async fn discover_vllm_mlx_models(&mut self) -> usize

Discover generic MLX models from a running vLLM-MLX server and register them. Returns the number of discovered models added or refreshed in the registry.

Source

pub fn outcome_tracker(&self) -> Arc<RwLock<OutcomeTracker>> ⓘ

Get outcome tracker for external use (e.g., memgine integration).

Source

pub async fn save_outcomes(&self) -> Result<(), Error>

Persist outcome profiles to disk for cross-session learning (#13).

Source

pub async fn save_key_pool_stats(&self) -> Result<(), Error>

Persist key pool stats to disk.

Source

pub async fn key_pool_stats(&self) -> HashMap<String, Vec<KeyStats>>

Get key pool stats for all endpoints.

Source

pub async fn export_profiles(&self) -> Vec<ModelProfile>

Export model performance profiles for persistence.

Source

pub async fn import_profiles(&self, profiles: Vec<ModelProfile>)

Import model performance profiles (from persistence).

Source

pub async fn prepare_speech_runtime(&self) -> Result<PathBuf, InferenceError>

Ensure the managed local speech runtime exists and return its root directory. On Apple Silicon, speech uses native MLX backends and no Python runtime is needed.

Source

pub fn set_speech_policy(&mut self, policy: SpeechPolicy)

Override speech routing preferences for the current engine instance.

Source

pub fn set_routing_config(&mut self, config: RoutingConfig)

Source

pub async fn install_curated_speech( &mut self, ) -> Result<Vec<SpeechInstallReport>, InferenceError>

Download the curated local speech model set into the shared Hugging Face cache.

Source

pub fn speech_health(&self) -> SpeechHealthReport

Report speech runtime, model cache, and remote-provider health.

Source

pub async fn model_health(&self) -> ModelHealthReport

Report the current model catalog, configured defaults, capability coverage, and speech runtime/provider health in one place.

Source

pub async fn smoke_test_speech( &self, local: bool, remote: bool, ) -> Result<SpeechSmokeReport, InferenceError>

Run a real speech smoke test through the configured local and/or remote paths.

Trait Implementations§

Source §

impl InferenceHandle for InferenceEngine

Source §

fn generate<'life0, 'async_trait>( &'life0 self, req: GenerateRequest, ) -> Pin<Box<dyn Future<Output = Result<String, InferenceError>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait,

Run a generation request to completion. Same contract as InferenceEngine::generate: caller passes a GenerateRequest (which may carry an explicit model, a routing hint, tools, or a thinking budget), receives the final text or an InferenceError.

Source §

fn embed<'life0, 'async_trait>( &'life0 self, req: EmbedRequest, ) -> Pin<Box<dyn Future<Output = Result<Vec<Vec<f32>>, InferenceError>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait,

Encode one or more texts as embedding vectors. Same contract as InferenceEngine::embed: returns one Vec<f32> per input text in the same order.

Auto Trait Implementations§

§

impl !UnwindSafe for InferenceEngine

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T> Instrument for T

Source §

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more

Source §

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> IntoEither for T

Source §

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

impl<T> Pointable for T

Source §

const ALIGN: usize

The alignment of pointer.

Source §

type Init = T

The type for initializers.

Source §

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more

Source §

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more

Source §

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more

Source §

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more

Source §

impl<T> PolicyExt for T
where T: ?Sized,

Source §

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more

Source §

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more

Source §

impl<T> Same for T

Source §

type Output = T

Should always be Self

Source §

impl<T, U> TryFrom for T
where U: Into<T>,

Source §

type Error = Infallible

The type returned in the event of a conversion error.

Source §

fn try_from(value: U) -> Result<T, <T as TryFrom>::Error>

Performs the conversion.

Source §

impl<T, U> TryInto for T
where U: TryFrom<T>,

Source §

type Error = >::Error

The type returned in the event of a conversion error.

Source §

fn try_into(self) -> Result<U, >::Error>

Performs the conversion.

Source §

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source §

fn vzip(self) -> V

Source §

impl<T> WithSubscriber for T

Source §

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more

Source §

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more

Source §

Struct InferenceEngine Copy item path

Fields§

Implementations§

impl InferenceEngine

pub fn new(config: InferenceConfig) -> Self

pub async fn init_key_pool(&self)

pub async fn warm_up<S: AsRef<str>>( &self, _schema_ids: &[S], ) -> Vec<Result<(), InferenceError>>

pub async fn route_adaptive(&self, prompt: &str) -> AdaptiveRoutingDecision

pub fn route(&self, prompt: &str) -> RoutingDecision

pub fn estimated_tokens( &self, req: &GenerateRequest, model_id: Option<&str>, ) -> (usize, usize, bool)

pub async fn generate_tracked( &self, req: GenerateRequest, ) -> Result<InferenceResult, InferenceError>

pub async fn generate_tracked_stream( &self, req: GenerateRequest, ) -> Result<Receiver<StreamEvent>, InferenceError>

§Local streaming behavior

§Example: voice app integration

pub async fn route_context_snapshot( &self, prompt: &str, workload: RoutingWorkload, has_tools: bool, has_vision: bool, ) -> AdaptiveRoutingDecision

pub async fn generate( &self, req: GenerateRequest, ) -> Result<String, InferenceError>

pub async fn tokenize( &self, model: &str, text: &str, ) -> Result<Vec<u32>, InferenceError>

pub async fn detokenize( &self, model: &str, tokens: &[u32], ) -> Result<String, InferenceError>

pub async fn embed( &self, req: EmbedRequest, ) -> Result<Vec<Vec<f32>>, InferenceError>

pub async fn rerank( &self, req: RerankRequest, ) -> Result<RerankResult, InferenceError>

§Scoring

§Prompt template

pub async fn ground( &self, req: GroundRequest, ) -> Result<GroundResult, InferenceError>

pub async fn classify( &self, req: ClassifyRequest, ) -> Result<Vec<ClassifyResult>, InferenceError>

pub async fn transcribe( &self, req: TranscribeRequest, ) -> Result<TranscribeResult, InferenceError>

pub async fn synthesize( &self, req: SynthesizeRequest, ) -> Result<SynthesizeResult, InferenceError>

pub async fn generate_image( &self, req: GenerateImageRequest, ) -> Result<GenerateImageResult, InferenceError>

pub async fn generate_image_batch( &self, req: GenerateImageRequest, ) -> Result<Vec<GenerateImageResult>, InferenceError>

pub async fn generate_video( &self, req: GenerateVideoRequest, ) -> Result<GenerateVideoResult, InferenceError>

pub fn list_models_unified(&self) -> Vec<ModelInfo>

pub fn available_model_upgrades(&self) -> Vec<ModelUpgrade>

pub fn list_schemas(&self) -> Vec<ModelSchema>

pub fn list_models(&self) -> Vec<ModelInfo>

pub async fn pull_model(&self, name: &str) -> Result<PathBuf, InferenceError>

pub fn remove_model(&self, name: &str) -> Result<(), InferenceError>

pub fn register_model(&mut self, schema: ModelSchema)

pub async fn discover_vllm_mlx_models(&mut self) -> usize

pub fn outcome_tracker(&self) -> Arc<RwLock<OutcomeTracker>> ⓘ

pub async fn save_outcomes(&self) -> Result<(), Error>

pub async fn save_key_pool_stats(&self) -> Result<(), Error>

pub async fn key_pool_stats(&self) -> HashMap<String, Vec<KeyStats>>

pub async fn export_profiles(&self) -> Vec<ModelProfile>

pub async fn import_profiles(&self, profiles: Vec<ModelProfile>)

pub async fn prepare_speech_runtime(&self) -> Result<PathBuf, InferenceError>

pub fn set_speech_policy(&mut self, policy: SpeechPolicy)

pub fn set_routing_config(&mut self, config: RoutingConfig)

pub async fn install_curated_speech( &mut self, ) -> Result<Vec<SpeechInstallReport>, InferenceError>

pub fn speech_health(&self) -> SpeechHealthReport

pub async fn model_health(&self) -> ModelHealthReport

pub async fn smoke_test_speech( &self, local: bool, remote: bool, ) -> Result<SpeechSmokeReport, InferenceError>

Trait Implementations§

impl InferenceHandle for InferenceEngine

fn generate<'life0, 'async_trait>( &'life0 self, req: GenerateRequest, ) -> Pin<Box<dyn Future<Output = Result<String, InferenceError>> + Send + 'async_trait>>where Self: 'async_trait, 'life0: 'async_trait,

fn embed<'life0, 'async_trait>( &'life0 self, req: EmbedRequest, ) -> Pin<Box<dyn Future<Output = Result<Vec<Vec<f32>>, InferenceError>> + Send + 'async_trait>>where Self: 'async_trait, 'life0: 'async_trait,

Auto Trait Implementations§

impl !Freeze for InferenceEngine

impl !RefUnwindSafe for InferenceEngine

impl Send for InferenceEngine

impl Sync for InferenceEngine

impl Unpin for InferenceEngine

impl UnsafeUnpin for InferenceEngine

impl !UnwindSafe for InferenceEngine

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T> Pointable for T

Struct InferenceEngine

fn generate<'life0, 'async_trait>( &'life0 self, req: GenerateRequest, ) -> Pin<Box<dyn Future<Output = Result<String, InferenceError>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait,

fn embed<'life0, 'async_trait>( &'life0 self, req: EmbedRequest, ) -> Pin<Box<dyn Future<Output = Result<Vec<Vec<f32>>, InferenceError>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T> PolicyExt for T
where T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

impl<T> ErasedDestructor for T
where T: 'static,