pub struct InferenceEngine {
pub config: InferenceConfig,
pub unified_registry: UnifiedRegistry,
pub adaptive_router: AdaptiveRouter,
pub outcome_tracker: Arc<RwLock<OutcomeTracker>>,
pub registry: ModelRegistry,
pub router: ModelRouter,
/* private fields */
}Expand description
The main inference engine. Thread-safe, lazily loads models.
Now includes the unified registry, adaptive router, and outcome tracker for schema-driven model selection with learned performance profiles.
Fields§
§config: InferenceConfig§unified_registry: UnifiedRegistryUnified model registry (local + remote).
adaptive_router: AdaptiveRouterAdaptive router with three-phase selection.
outcome_tracker: Arc<RwLock<OutcomeTracker>>Outcome tracker for learning from results.
registry: ModelRegistry§router: ModelRouterImplementations§
Source§impl InferenceEngine
impl InferenceEngine
pub fn new(config: InferenceConfig) -> Self
Sourcepub async fn init_key_pool(&self)
pub async fn init_key_pool(&self)
Initialize key pool: register keys from all remote models and load persisted stats. Call this after construction (requires async).
Sourcepub async fn warm_up<S: AsRef<str>>(
&self,
_schema_ids: &[S],
) -> Vec<Result<(), InferenceError>>
pub async fn warm_up<S: AsRef<str>>( &self, _schema_ids: &[S], ) -> Vec<Result<(), InferenceError>>
No-op on non-macOS — MLX doesn’t run here.
Sourcepub async fn route_adaptive(&self, prompt: &str) -> AdaptiveRoutingDecision
pub async fn route_adaptive(&self, prompt: &str) -> AdaptiveRoutingDecision
Route a prompt using the adaptive router (new). Returns full decision context.
Sourcepub fn route(&self, prompt: &str) -> RoutingDecision
pub fn route(&self, prompt: &str) -> RoutingDecision
Route a prompt to the best model without executing (legacy compat).
Sourcepub fn estimated_tokens(
&self,
req: &GenerateRequest,
model_id: Option<&str>,
) -> (usize, usize, bool)
pub fn estimated_tokens( &self, req: &GenerateRequest, model_id: Option<&str>, ) -> (usize, usize, bool)
Estimate token count for a request against a specific model’s context window. Returns (estimated_input_tokens, context_window_tokens, fits).
Sourcepub async fn generate_tracked(
&self,
req: GenerateRequest,
) -> Result<InferenceResult, InferenceError>
pub async fn generate_tracked( &self, req: GenerateRequest, ) -> Result<InferenceResult, InferenceError>
Generate text from a prompt with outcome tracking.
Returns InferenceResult with trace_id for reporting outcomes.
Sourcepub async fn generate_tracked_stream(
&self,
req: GenerateRequest,
) -> Result<Receiver<StreamEvent>, InferenceError>
pub async fn generate_tracked_stream( &self, req: GenerateRequest, ) -> Result<Receiver<StreamEvent>, InferenceError>
Stream a response with real-time token output.
Returns a channel receiver yielding StreamEvent variants:
TextDelta(String)— partial text tokenToolCallStart { name, index, id }— tool call beginsToolCallDelta { index, arguments_delta }— partial tool arguments
Use StreamAccumulator to collect events into a final result.
§Local streaming behavior
Local models (both MLX and Candle backends) emit true incremental TextDelta
events as each token is generated. This enables:
- Token-by-token UI updates
- Overlapping speech synthesis with generation (for voice apps)
- Early cancellation when a stop sequence is detected
The channel capacity is 64 events, providing buffering for burst tokens without blocking the generation loop.
§Example: voice app integration
let mut rx = engine.generate_tracked_stream(req).await?;
let mut text_buf = String::new();
while let Some(event) = rx.recv().await {
match event {
StreamEvent::TextDelta(delta) => {
text_buf.push_str(&delta);
// Feed text_buf to TTS when a sentence boundary is reached
}
StreamEvent::Done { text, .. } => break,
_ => {}
}
}Sourcepub async fn route_context_snapshot(
&self,
prompt: &str,
workload: RoutingWorkload,
has_tools: bool,
has_vision: bool,
) -> AdaptiveRoutingDecision
pub async fn route_context_snapshot( &self, prompt: &str, workload: RoutingWorkload, has_tools: bool, has_vision: bool, ) -> AdaptiveRoutingDecision
Route a prompt using the adaptive router without executing inference.
Sourcepub async fn generate(
&self,
req: GenerateRequest,
) -> Result<String, InferenceError>
pub async fn generate( &self, req: GenerateRequest, ) -> Result<String, InferenceError>
Generate text from a prompt (legacy API, no outcome tracking).
When req.model is None, uses intelligent routing based on prompt complexity.
Sourcepub async fn tokenize(
&self,
model: &str,
text: &str,
) -> Result<Vec<u32>, InferenceError>
pub async fn tokenize( &self, model: &str, text: &str, ) -> Result<Vec<u32>, InferenceError>
Encode text via the named model’s tokenizer. Returns raw token IDs
without any chat-template wrapping or BOS-prepending — pair with
Self::detokenize for the round-trip property
detokenize(model, tokenize(model, s)) == s for any UTF-8 s.
Only local models have a tokenizer the runtime can call directly
(Candle/GGUF on Linux/Windows, MLX on Apple Silicon). For remote
models the call returns
InferenceError::UnsupportedMode — provider tokenizer endpoints
vary too widely to be portable here, and bundling tiktoken-style
tables would lock the registry to a fixed set of providers.
Sourcepub async fn detokenize(
&self,
model: &str,
tokens: &[u32],
) -> Result<String, InferenceError>
pub async fn detokenize( &self, model: &str, tokens: &[u32], ) -> Result<String, InferenceError>
Inverse of Self::tokenize: decode token IDs back to text.
Sourcepub async fn embed(
&self,
req: EmbedRequest,
) -> Result<Vec<Vec<f32>>, InferenceError>
pub async fn embed( &self, req: EmbedRequest, ) -> Result<Vec<Vec<f32>>, InferenceError>
Generate embeddings for text using the dedicated embedding model. On Apple Silicon, uses the native MLX backend; on other platforms, uses Candle.
Sourcepub async fn rerank(
&self,
req: RerankRequest,
) -> Result<RerankResult, InferenceError>
pub async fn rerank( &self, req: RerankRequest, ) -> Result<RerankResult, InferenceError>
Rerank candidate documents against a query using a cross-encoder reranker model (Qwen3-Reranker family). Returns documents sorted by descending relevance.
§Scoring
Qwen3-Reranker is a Qwen3 base LM fine-tuned so that the first
assistant token is "yes" or "no" given the templated
<Instruct>/<Query>/<Document> user turn. We run a short
greedy decode (≤ 3 tokens, so a leading space, BOS artifact, or
the occasional newline don’t break us) and score
yes → 1.0, no → 0.0, anything else → 0.5 with a warning.
This is a binary score — the soft probability
softmax(logit_yes, logit_no) would give finer ordering but
requires per-token logit access on [backend::MlxBackend],
which isn’t exposed publicly yet. Tracked as a follow-up;
binary scores still produce a correct partial ordering, just
with coarser tiebreaks within the {yes} or {no} groups.
§Prompt template
We emit the upstream Qwen3-Reranker chat template verbatim:
a dedicated system prompt fixing the yes/no answer space,
then the user turn with <Instruct>/<Query>/<Document>, then
the assistant prefix with a closed empty <think> block to
suppress thinking (reranker is not a reasoner — it’s a
classifier). Deviating from this template produces sharply
degraded yes/no distributions.
Sourcepub async fn ground(
&self,
req: GroundRequest,
) -> Result<GroundResult, InferenceError>
pub async fn ground( &self, req: GroundRequest, ) -> Result<GroundResult, InferenceError>
Dedicated endpoint for structured visual grounding.
Runs a VL generate call under the hood and parses Qwen2.5-VL’s
inline <|object_ref_*|>...<|box_*|>(x1,y1),(x2,y2) spans into
typed BoundingBoxes. Distinct from the generic
InferenceEngine::generate + InferenceResult.bounding_boxes
path so callers can express “I want boxes” as a first-class
intent — which also lets the router prefer models that declare
the Grounding capability.
Sourcepub async fn classify(
&self,
req: ClassifyRequest,
) -> Result<Vec<ClassifyResult>, InferenceError>
pub async fn classify( &self, req: ClassifyRequest, ) -> Result<Vec<ClassifyResult>, InferenceError>
Classify text against candidate labels.
When req.model is None, routes to the smallest available model.
Sourcepub async fn transcribe(
&self,
req: TranscribeRequest,
) -> Result<TranscribeResult, InferenceError>
pub async fn transcribe( &self, req: TranscribeRequest, ) -> Result<TranscribeResult, InferenceError>
Transcribe an audio file using the best available STT model.
Sourcepub async fn synthesize(
&self,
req: SynthesizeRequest,
) -> Result<SynthesizeResult, InferenceError>
pub async fn synthesize( &self, req: SynthesizeRequest, ) -> Result<SynthesizeResult, InferenceError>
Synthesize speech using the best available TTS model.
Sourcepub async fn generate_image(
&self,
req: GenerateImageRequest,
) -> Result<GenerateImageResult, InferenceError>
pub async fn generate_image( &self, req: GenerateImageRequest, ) -> Result<GenerateImageResult, InferenceError>
Generate an image using the best available local MLX image model.
Sourcepub async fn generate_image_batch(
&self,
req: GenerateImageRequest,
) -> Result<Vec<GenerateImageResult>, InferenceError>
pub async fn generate_image_batch( &self, req: GenerateImageRequest, ) -> Result<Vec<GenerateImageResult>, InferenceError>
Generate one or more variants in a single call.
Returns req.variant_count results (defaulting to 1). The
current MLX Flux backend doesn’t support native batching, so
this loops over generate_image with the seed advanced per
variant for visual diversity. A future hosted backend
(gpt-image-2, Replicate) can short-circuit this with one
network call producing N coherent images.
Per-variant errors abort the batch — there’s no partial-
success semantics today. Callers needing more lenient
behaviour should call generate_image directly in their own
loop.
Closes #110.
Sourcepub async fn generate_video(
&self,
req: GenerateVideoRequest,
) -> Result<GenerateVideoResult, InferenceError>
pub async fn generate_video( &self, req: GenerateVideoRequest, ) -> Result<GenerateVideoResult, InferenceError>
Generate a video using the best available local MLX video model.
Sourcepub fn list_models_unified(&self) -> Vec<ModelInfo>
pub fn list_models_unified(&self) -> Vec<ModelInfo>
List all known models and their status (new registry).
Sourcepub fn available_model_upgrades(&self) -> Vec<ModelUpgrade>
pub fn available_model_upgrades(&self) -> Vec<ModelUpgrade>
Report installed models that have curated newer replacements.
Sourcepub fn list_schemas(&self) -> Vec<ModelSchema>
pub fn list_schemas(&self) -> Vec<ModelSchema>
List all known models and their download status (legacy). List all model schemas from the unified registry (full metadata).
pub fn list_models(&self) -> Vec<ModelInfo>
Sourcepub async fn pull_model(&self, name: &str) -> Result<PathBuf, InferenceError>
pub async fn pull_model(&self, name: &str) -> Result<PathBuf, InferenceError>
Download a model if not already present.
Sourcepub fn remove_model(&self, name: &str) -> Result<(), InferenceError>
pub fn remove_model(&self, name: &str) -> Result<(), InferenceError>
Remove a downloaded model.
Sourcepub fn register_model(&mut self, schema: ModelSchema)
pub fn register_model(&mut self, schema: ModelSchema)
Register a model in the unified registry.
Sourcepub async fn discover_vllm_mlx_models(&mut self) -> usize
pub async fn discover_vllm_mlx_models(&mut self) -> usize
Discover generic MLX models from a running vLLM-MLX server and register them. Returns the number of discovered models added or refreshed in the registry.
Sourcepub fn outcome_tracker(&self) -> Arc<RwLock<OutcomeTracker>> ⓘ
pub fn outcome_tracker(&self) -> Arc<RwLock<OutcomeTracker>> ⓘ
Get outcome tracker for external use (e.g., memgine integration).
Sourcepub async fn save_outcomes(&self) -> Result<(), Error>
pub async fn save_outcomes(&self) -> Result<(), Error>
Persist outcome profiles to disk for cross-session learning (#13).
Sourcepub async fn save_key_pool_stats(&self) -> Result<(), Error>
pub async fn save_key_pool_stats(&self) -> Result<(), Error>
Persist key pool stats to disk.
Sourcepub async fn key_pool_stats(&self) -> HashMap<String, Vec<KeyStats>>
pub async fn key_pool_stats(&self) -> HashMap<String, Vec<KeyStats>>
Get key pool stats for all endpoints.
Sourcepub async fn export_profiles(&self) -> Vec<ModelProfile>
pub async fn export_profiles(&self) -> Vec<ModelProfile>
Export model performance profiles for persistence.
Sourcepub async fn import_profiles(&self, profiles: Vec<ModelProfile>)
pub async fn import_profiles(&self, profiles: Vec<ModelProfile>)
Import model performance profiles (from persistence).
Sourcepub async fn prepare_speech_runtime(&self) -> Result<PathBuf, InferenceError>
pub async fn prepare_speech_runtime(&self) -> Result<PathBuf, InferenceError>
Ensure the managed local speech runtime exists and return its root directory. On Apple Silicon, speech uses native MLX backends and no Python runtime is needed.
Sourcepub fn set_speech_policy(&mut self, policy: SpeechPolicy)
pub fn set_speech_policy(&mut self, policy: SpeechPolicy)
Override speech routing preferences for the current engine instance.
pub fn set_routing_config(&mut self, config: RoutingConfig)
Sourcepub async fn install_curated_speech(
&mut self,
) -> Result<Vec<SpeechInstallReport>, InferenceError>
pub async fn install_curated_speech( &mut self, ) -> Result<Vec<SpeechInstallReport>, InferenceError>
Download the curated local speech model set into the shared Hugging Face cache.
Sourcepub fn speech_health(&self) -> SpeechHealthReport
pub fn speech_health(&self) -> SpeechHealthReport
Report speech runtime, model cache, and remote-provider health.
Sourcepub async fn model_health(&self) -> ModelHealthReport
pub async fn model_health(&self) -> ModelHealthReport
Report the current model catalog, configured defaults, capability coverage, and speech runtime/provider health in one place.
Sourcepub async fn smoke_test_speech(
&self,
local: bool,
remote: bool,
) -> Result<SpeechSmokeReport, InferenceError>
pub async fn smoke_test_speech( &self, local: bool, remote: bool, ) -> Result<SpeechSmokeReport, InferenceError>
Run a real speech smoke test through the configured local and/or remote paths.
Trait Implementations§
Source§impl InferenceHandle for InferenceEngine
impl InferenceHandle for InferenceEngine
Source§fn generate<'life0, 'async_trait>(
&'life0 self,
req: GenerateRequest,
) -> Pin<Box<dyn Future<Output = Result<String, InferenceError>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
fn generate<'life0, 'async_trait>(
&'life0 self,
req: GenerateRequest,
) -> Pin<Box<dyn Future<Output = Result<String, InferenceError>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
InferenceEngine::generate: caller passes a GenerateRequest
(which may carry an explicit model, a routing hint, tools,
or a thinking budget), receives the final text or an
InferenceError.Source§fn embed<'life0, 'async_trait>(
&'life0 self,
req: EmbedRequest,
) -> Pin<Box<dyn Future<Output = Result<Vec<Vec<f32>>, InferenceError>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
fn embed<'life0, 'async_trait>(
&'life0 self,
req: EmbedRequest,
) -> Pin<Box<dyn Future<Output = Result<Vec<Vec<f32>>, InferenceError>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
InferenceEngine::embed: returns one Vec<f32> per input
text in the same order.Auto Trait Implementations§
impl !Freeze for InferenceEngine
impl !RefUnwindSafe for InferenceEngine
impl Send for InferenceEngine
impl Sync for InferenceEngine
impl Unpin for InferenceEngine
impl UnsafeUnpin for InferenceEngine
impl !UnwindSafe for InferenceEngine
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more