oxillama-runtime
Full inference runtime for transformer LLMs — KV cache, sampling, tokenizer, and advanced decoding.
Part of the OxiLLaMa workspace — a Pure Rust LLM inference engine.
Status
Version: 0.1.2 — Tests: 370 passing — Completion: ~98% — Status: Alpha
What It Provides
- InferenceEngine: single-batch and continuous-batch forward pass over any architecture
- Paged KV cache: block-manager for efficient memory reuse across multiple requests
- FlashAttention: tiled CPU kernel (
BQ=BK=64, online softmax, causal masking) dispatched aboveFLASH_ATTN_THRESHOLD=512tokens - Continuous batching:
BatchedKvViewtrait +KvSlotstruct +VecBatchedKvViewfor true per-request KV slot isolation - Sampling pipeline: greedy, top-K, top-P (nucleus), min-P, temperature, repetition penalty, mirostat v1/v2, grammar-constrained (GBNF)
- Tokenizer bridge: HuggingFace
tokenizerswithonig(native) orunstable_wasm(pure-Rust) regex backends - LoRA adapters: load and hot-swap rank-decomposition adapters at runtime via
LoraStack - Speculative decoding: draft-model + verifier pipeline (
SpeculativeEngine) with delta-sync KV resync
Key Types
| Type | Description |
|---|---|
InferenceEngine |
Main engine; wraps model + cache + sampler |
SamplerConfig |
Builder for all sampling hyper-parameters |
PagedKvCache |
Block-paged KV store with eviction policy |
SpeculativeEngine |
Draft+target model pair for speculative decoding |
LoadedLora |
In-memory LoRA adapter ready to apply |
Grammar / GrammarState |
GBNF grammar parser and logit-mask state machine |
Scheduler |
Continuous-batching scheduler with prefill priority and chunked prefill |
RuntimeError |
Unified error wrapping ArchError, GgufError, QuantError |
EngineSnapshot / ModelFingerprint |
Session snapshot and resume via oxicode — new in v0.1.1 |
ToolDispatcher / ToolCallDetector / ToolCall / NoOpDispatcher |
Tool/function-calling trait and helpers — new in v0.1.1 |
SpeculativeDecoder / AsyncSpecConfig / SpecStats |
Async speculative decoding pipeline — new in v0.1.1 |
PrefixKvCache / PrefixCacheConfig |
Prompt-prefix KV cache with radix-tree lookup — new in v0.1.1 |
KvCachePool |
Pooled KV cache allocator for multi-request reuse — new in v0.1.1 |
EngineMetrics / MetricsSnapshot |
Prometheus-compatible lock-free counters — new in v0.1.1 |
SequencePool / SsmStatePool |
Attention and SSM sequence state pools — new in v0.1.1 |
KvCacheAccess |
Trait extension: kv_dim, for_each_key, for_each_value with contiguous defaults and PagedKvCache multi-page overrides — new in v0.1.2 |
BatchedKvView / KvSlot |
Moved to oxillama-arch/traits.rs; re-exported from oxillama-runtime for backwards compatibility — new in v0.1.2 |
ForwardPass::forward_batched |
Default impl on ForwardPass trait; LLaMA proof-of-concept continuous-batch forward — new in v0.1.2 |
Usage
use ;
Feature Flags
| Feature | Default | Description |
|---|---|---|
llama |
yes | LLaMA 2/3/4 architecture |
qwen3 |
yes | Qwen3 architecture |
mistral |
yes | Mistral / Mixtral architecture |
gemma |
yes | Gemma 2/3 architecture |
phi |
yes | Phi-3/4 architecture |
command-r |
yes | Command-R architecture |
starcoder |
yes | StarCoder 2 architecture |
tokenizer-wasm |
yes | HF tokenizers with pure-Rust regex (required for WASM) |
tokenizer-onig |
no | HF tokenizers with Oniguruma regex (native desktop alternative) |
parallel |
no | Multi-threaded tensor ops via rayon |
native-async |
no | Tokio-backed async engine API |
mmap |
no | Memory-mapped model file loading |
offload |
no | Tensor offload to secondary storage |
License
Apache-2.0 — COOLJAPAN OU (Team Kitasan)