Expand description
§llama-engine
The “narrow waist” of the llama.rs stack. Defines the core LlamaEngine trait
and associated types that all other crates depend on. Implementations can swap
CPU/Metal/FFI backends without changing application code.
§Design Notes
§Interior Mutability
LlamaEngine methods take &self (not &mut self) to allow shared access across
multiple sessions and to enable concurrent inference without requiring exclusive
borrows or external synchronization at call sites. Backends using interior
mutability (e.g., Mutex, Arc<RwLock>) are still responsible for performing any
necessary internal synchronization to ensure thread-safe access to shared state.
§Token Type
TokenId is aliased as i32 for FFI compatibility, though token IDs are logically
non-negative. This will be reconsidered if a u32/usize conversion barrier emerges.
Structs§
- Decode
Result - Result of a single decode step.
- Model
Handle - Opaque handle to a loaded model.
- Model
Spec - Specification for loading a model.
- Prefill
Result - Result of the prefill phase (prompt processing).
- Session
- Represents an active inference session with its own KV cache state.
Enums§
- Llama
Error - Top-level error type for all engine operations.
Traits§
- Llama
Engine - The core engine trait — everything else plugs into this.