Skip to main content

Crate car_inference

Crate car_inference 

Source
Expand description

§car-inference

Local model inference for the Common Agent Runtime.

Provides on-device inference using Candle with automatic hardware detection:

  • macOS: Metal (Apple Silicon GPU)
  • Linux: CUDA (NVIDIA GPU) or CPU fallback

Ships with Qwen3 models downloaded on first use from HuggingFace. Supports remote API models (OpenAI, Anthropic, Google) via the same schema.

§Architecture

Models are first-class typed resources described by ModelSchema (analogous to ToolSchema). The UnifiedRegistry holds local and remote models. The AdaptiveRouter selects the best model using a three-phase strategy: filter → score → explore. The OutcomeTracker learns from results to improve routing over time.

§Dual purpose

  1. Internal — powers skill learning/repair, semantic memory, policy evaluation
  2. Service — exposes infer, embed, classify as built-in CAR tools

Re-exports§

pub use adaptive_router::AdaptiveRouter;
pub use adaptive_router::AdaptiveRoutingDecision;
pub use adaptive_router::RoutingConfig;
pub use adaptive_router::RoutingStrategy;
pub use handle::InferenceHandle;
pub use intent::IntentHint;
pub use intent::TaskHint;
pub use key_pool::KeyPool;
pub use key_pool::KeyStats;
pub use outcome::CodeOutcome;
pub use outcome::InferenceOutcome;
pub use outcome::InferenceTask;
pub use outcome::InferredOutcome;
pub use outcome::ModelProfile;
pub use outcome::OutcomeTracker;
pub use registry::ModelFilter;
pub use registry::ModelInfo;
pub use registry::ModelRuntimeRequirement;
pub use registry::ModelUpgrade;
pub use registry::UnifiedRegistry;
pub use remote::RemoteBackend;
pub use routing_ext::CircuitBreaker;
pub use routing_ext::CircuitBreakerRegistry;
pub use routing_ext::CircuitState;
pub use routing_ext::ImplicitSignal;
pub use routing_ext::ImplicitSignalType;
pub use routing_ext::RoutingMode;
pub use routing_ext::SpendControl;
pub use routing_ext::SpendLimitExceeded;
pub use routing_ext::SpendLimits;
pub use routing_ext::SpendStatus;
pub use runner::current_inference_runner;
pub use runner::set_inference_runner;
pub use runner::EventEmitter;
pub use runner::InferenceRunner;
pub use runner::RunnerError;
pub use runner::RunnerResult;
pub use schema::ApiProtocol;
pub use schema::BenchmarkScore;
pub use schema::CostModel;
pub use schema::ModelCapability;
pub use schema::ModelSchema;
pub use schema::ModelSource;
pub use schema::PerformanceEnvelope;
pub use schema::ProprietaryAuth;
pub use adaptive_router::TaskComplexity;
pub use backend::CandleBackend;
pub use backend::EmbeddingBackend;
pub use hardware::HardwareInfo;
pub use models::ModelRegistry;
pub use models::ModelRole;
pub use router::ModelRouter;
pub use router::RoutingDecision;
pub use stream::StreamAccumulator;
pub use stream::StreamEvent;
pub use tasks::parse_boxes;
pub use tasks::BoundingBox;
pub use tasks::ClassifyRequest;
pub use tasks::ClassifyResult;
pub use tasks::ContentBlock;
pub use tasks::EmbedRequest;
pub use tasks::GenerateImageRequest;
pub use tasks::GenerateImageResult;
pub use tasks::GenerateParams;
pub use tasks::GenerateRequest;
pub use tasks::GenerateVideoRequest;
pub use tasks::GenerateVideoResult;
pub use tasks::GroundRequest;
pub use tasks::GroundResult;
pub use tasks::Message;
pub use tasks::RerankRequest;
pub use tasks::RerankResult;
pub use tasks::RerankedDocument;
pub use tasks::RoutingWorkload;
pub use tasks::SynthesizeRequest;
pub use tasks::SynthesizeResult;
pub use tasks::ThinkingMode;
pub use tasks::ToolCall;
pub use tasks::TranscribeRequest;
pub use tasks::TranscribeResult;
pub use tasks::VideoMode;

Modules§

adaptive_router
Adaptive model routing — three-phase routing with learned performance profiles.
backend
backend_cache
LRU-evicting cache for loaded inference backends.
handle
InferenceHandle — minimal trait surface for embedders that hold a “thing that can do inference” without committing to the concrete InferenceEngine.
hardware
Hardware detection — auto-configure models and context based on system capabilities.
intent
Caller-facing routing intent — express requirements, not model IDs.
key_pool
API key pool — load-balanced multi-key management for remote endpoints.
models
Model registry — tracks available Qwen3 models, handles download-on-first-use.
outcome
Outcome tracking — learn from inference results to improve routing.
protocol
Protocol abstraction — unified interface for remote model API providers.
registry
Unified model registry — local and remote models under one schema.
remote
Remote inference backend — HTTP client for cloud API models.
router
Intelligent model routing — select the best model based on prompt characteristics.
routing_ext
Extended routing features — routing modes, circuit breaker, implicit feedback, spend control, and benchmark quality priors.
runner
Foreign-implemented inference runner.
schema
Model schema — declarative metadata for models, analogous to ToolSchema for tools.
service
Inference service — exposes inference as built-in CAR tools.
stream
Streaming inference — SSE parsing for real-time token output.
tasks
vllm_mlx
vLLM-MLX local server integration.

Structs§

InferenceConfig
Configuration for the inference engine.
InferenceEngine
The main inference engine. Thread-safe, lazily loads models.
InferenceResult
Result of an inference call, including trace ID for outcome tracking.
ModelBenchmarkPriorHealth
ModelCapabilityHealth
ModelDefaultHealth
ModelHealthReport
ModelProviderHealth
RoutingScenarioHealth
SpeechHealthReport
SpeechInstallReport
SpeechModelHealth
SpeechPolicy
SpeechRuntimeHealth
SpeechSmokePathReport
SpeechSmokeReport
TokenUsage
Token usage statistics from a model response.

Enums§

Device
Which device to run inference on.
InferenceError