inference-runtime-openai
OpenAI Chat Completions + Azure OpenAI runtime. ~600 LOC; everything heavy is in
inference-remote-core.
Highlights
- Both variants in one config.
OpenAiVariant::Direct { endpoint }forapi.openai.com(or any OpenAI-compatible proxy);OpenAiVariant::Azure { resource, deployment, api_version }for Azure OpenAI's URL shape. - SSE streaming first-class.
RunHandle::streaming(...)overreqwest::Response::bytes_stream→eventsource-stream→TokenChunks. - Provider-specific error refinement. A 400 with
error.code == "context_length_exceeded"is upgraded toInferenceError::ContextLengthExceeded; a 400 of typecontent_filteris upgraded toInferenceError::ContentFilteredand skips retries. - Pricing baked in.
OpenAiPricing::published()carries the current list rates forgpt-4o,gpt-4o-mini,gpt-4-turbo,o1-preview,o1-mini. Operators override per deployment.
Quick start
use ;
use SecretRef;
let cfg = defaults_for_openai;
let runner = new?;
session_snapshot: Arc<ArcSwap<SessionSnapshot>> comes from
inference_remote_core::session::RemoteSessionActor::bootstrap(...).
The double-indirection (Arc<ArcSwap<…>>) means rotating the API key
swaps the snapshot in-place — in-flight requests complete with the
old credential, new ones use the rotated value, with no traffic
dropped.
Azure example
use ;
let cfg = OpenAiConfig ;
Cost estimation
use OpenAiPricing;
let p = published.get.unwrap;
// p.input_per_mtok_usd, p.output_per_mtok_usd
The rollup's inference::core::cost::from_rates(...) lifts these into
a CostEstimate per ExecuteBatch. MetricsActor aggregates the
real numbers reported by the API in response headers / usage fields,
so dashboards see live spend, not just predictions.