Crate inferd_engine

Expand description

Backend trait and adapters for inferd.

See ADR 0005 (engine consumed via FFI), ADR 0007 (routing), and docs/ai.internals.explained.md for the architectural framing.

v0.1 ships:

mock — deterministic test double, always available.
llamacpp — FFI to vendored libllama (gated behind the llamacpp cargo feature; lands in M2a).

Modules§

mock: Deterministic mock backend used by tests and by the daemon’s M1 echo milestone.

AcceleratorInfo: Snapshot of the active hardware-acceleration configuration.
BackendCapabilities: Per-backend capability advertisement. The daemon consults this on boot to decide whether v2 multimodal / tool-use requests can be dispatched, and reports the advertised set on the admin status surface so middleware authors can introspect what the running daemon can do without trial-and-error.
EmbedResult: Result of a successful Backend::embed() call.

AcceleratorKind: Hardware-acceleration backend the engine adapter is built and running with. Reflects compile-time GGML feature flags. Pure CPU builds (no cuda / metal / vulkan / rocm features) report Cpu. A build with support but where n_gpu_layers == 0 also effectively uses CPU at runtime — see AcceleratorInfo::gpu_layers.
EmbedError: Errors returned by Backend::embed().
GenerateError: Errors returned by Backend::generate() before any tokens have streamed.
TokenEvent: One event in a generation stream.
TokenEventV2: One event in a v2 generation stream — typed-content-block surface per ADR 0015.

DEFAULT_V2_MAX_TOKENS: Default max_tokens for v2 requests when the consumer didn’t supply one. Lives here (rather than in inferd-proto) because v2 sampling defaults are backend-specific (per ADR 0015): the proto crate doesn’t pick them, the active backend does.

TokenStream: Stream of TokenEvent values produced by a backend during generation.
TokenStreamV2: Stream of TokenEventV2 values produced by a backend during a v2 generation. Dropping the stream cancels the in-flight generation.