Crate llama_engine

Expand description

§llama-engine

The “narrow waist” of the llama.rs stack. Defines the core LlamaEngine trait and associated types that all other crates depend on. Implementations can swap CPU/Metal/FFI backends without changing application code.

§Design Notes

§Interior Mutability

LlamaEngine methods take &self (not &mut self) to allow shared access across multiple sessions and to enable concurrent inference without requiring exclusive borrows or external synchronization at call sites. Backends using interior mutability (e.g., Mutex, Arc<RwLock>) are still responsible for performing any necessary internal synchronization to ensure thread-safe access to shared state.

§Token Type

TokenId is aliased as i32 for FFI compatibility, though token IDs are logically non-negative. This will be reconsidered if a u32/usize conversion barrier emerges.

Structs§

DecodeResult: Result of a single decode step.
ModelHandle: Opaque handle to a loaded model.
ModelSpec: Specification for loading a model.
PrefillResult: Result of the prefill phase (prompt processing).
Session: Represents an active inference session with its own KV cache state.

Enums§

LlamaError: Top-level error type for all engine operations.

Traits§

LlamaEngine: The core engine trait — everything else plugs into this.

Type Aliases§

Result
TokenId: Token ID type (i32 for FFI compat; logically non-negative).