car-inference 0.32.1

Local model inference for CAR — Candle backend with Qwen3 models
# car-inference

Local and remote model inference for the [Common Agent Runtime](https://github.com/Parslee-ai/car).

## What it does

Provides on-device inference using Candle (Metal on macOS, CUDA on Linux) with Qwen3 models
downloaded from HuggingFace on first use. Also supports remote APIs (OpenAI, Anthropic, Google)
via the same typed `ModelSchema` interface. The `AdaptiveRouter` selects the best model using
a filter-score-explore strategy, learning from outcomes over time via `OutcomeTracker`.

## Usage

```rust
use car_inference::{InferenceEngine, InferenceConfig, GenerateRequest, GenerateParams};

let engine = InferenceEngine::new(InferenceConfig::default());
let result = engine.generate(GenerateRequest {
    prompt: "Explain quicksort".into(),
    params: GenerateParams::default(),
    ..Default::default()
}).await?;
```

## Apple FoundationModels backend (macOS 26+)

On Apple Silicon Macs running macOS 26 or later with Apple Intelligence
provisioned, CAR can route inference to the **system LLM** through
Apple's FoundationModels framework — no model file to download, no API
key, no model weights. The OS owns everything.

The integration is a small Swift shim
(`car-inference/swift/CarFoundationModels.swift`) compiled by
`build.rs` and linked into the crate. The framework is **weak-linked**
so the produced binary still loads on pre-26 macOS; a runtime
availability check (`is_available()`, cached for 5 seconds) gates
calls.

```rust
// Routing happens automatically — request the apple/foundation:default
// model id, or let the AdaptiveRouter pick it for you.
let result = engine.generate(GenerateRequest {
    prompt: "summarize this in one sentence: ...".into(),
    model_id: Some("apple/foundation:default".into()),
    ..Default::default()
}).await?;
```

The catalog entry is tagged `["builtin", "local", "low_latency", "private"]`.
The adaptive router scores `low_latency` AND `private` together via
`SYSTEM_LLM_BONUS` (0.12) so system-owned models compete fairly with
MLX 4B for short fast-turn tasks (autocomplete, summarize, classify)
without claiming heavy reasoning workloads they can't serve.

**What's wired:**
- Single-shot text generation via `generate()`
- Token-by-token streaming via `stream()` with prefix-diffing on
  Apple's cumulative snapshots — `StreamEvent::Done.text` carries the
  full assembled output, matching Candle/MLX shape.
- Tool calling via `generate_with_tools()` — JSON-Schema tool
  definitions are converted to per-tool `DynamicGenerationSchema`s so
  the model sees real schemas. A capture-only Swift `Tool` records the
  first invocation and ends the turn; the call comes back as the
  standard `ToolCall` shape the remote backends emit and the runtime
  executes it (CAR's propose/validate/execute contract). At most one
  tool call per turn — sequential tool use, no parallel calls; the
  catalog entry claims `tool_use` but deliberately NOT
  `multi_tool_call`, the router-readable "no parallel calls" signal.
  When a call is captured, `text` is empty — pre-call assistant prose
  is discarded.
- Structured output via `generate_structured()`  `ResponseFormat::JsonSchema` maps onto FM's native constrained
  decoding (`respond(to:schema:)`). `JsonObject` has no native FM
  mode and falls back to instruction injection with a warning.
- Graceful fallthrough on vision / audio / video — returns
  `InferenceError::UnsupportedMode` (streaming included) so the router
  picks the next candidate instead of silently dropping capabilities.

**What isn't:**
- Multimodal input — the public FoundationModels API is text-only.
- Full JSON-Schema fidelity: the `DynamicGenerationSchema` conversion
  covers object/string/number/integer/boolean/array and string enums;
  typeless nodes (oneOf/anyOf/$ref), union types (`["number","null"]`),
  unrecognized types, and non-string enums degrade to permissive
  string fields (never to an empty object), and numeric enums keep the
  base type but lose the value constraint. Every degradation is
  detected on the Rust side and logged via `tracing::warn!` before
  crossing the FFI.

**Build requirements:**
- Full Xcode (not just Command Line Tools) on the build host —
  needed for `xcrun swiftc` and the FoundationModels SDK.
- macOS 15+ deployment floor (overridable via `MACOSX_DEPLOYMENT_TARGET`).

Linux and Intel-Mac builds skip the Swift compile entirely; the
`ModelSource::AppleFoundationModels` schema variant still serializes
(so registries can describe the model on any platform), but dispatch
errors out with `UnsupportedMode` before reaching the bridge.

## Crate features

- `metal` -- Apple Silicon GPU acceleration
- `cuda` -- NVIDIA GPU acceleration
- `ast` -- AST-aware code generation via `car-ast`

Part of [CAR](https://github.com/Parslee-ai/car) -- see the main repo for full documentation.