oxillama-runtime 0.1.0

Inference engine — KV cache, sampling, tokenizer bridge
Documentation

oxillama-runtime

Full inference runtime for transformer LLMs — KV cache, sampling, tokenizer, and advanced decoding.

Part of the OxiLLaMa workspace — a Pure Rust LLM inference engine.

What It Provides

  • InferenceEngine: single-batch and continuous-batch forward pass over any architecture
  • Paged KV cache: block-manager for efficient memory reuse across multiple requests
  • Sampling pipeline: greedy, top-K, top-P (nucleus), min-P, temperature, repetition penalty, mirostat v1/v2, grammar-constrained (GBNF)
  • Tokenizer bridge: HuggingFace tokenizers with onig (native) or unstable_wasm (pure-Rust) regex backends
  • LoRA adapters: load and hot-swap rank-decomposition adapters at runtime
  • Speculative decoding: draft-model + verifier pipeline (SpeculativeEngine)

Key Types

Type Description
InferenceEngine Main engine; wraps model + cache + sampler
SamplerConfig Builder for all sampling hyper-parameters
PagedKvCache Block-paged KV store with eviction policy
SpeculativeEngine Draft+target model pair for speculative decoding
LoadedLora In-memory LoRA adapter ready to apply
RuntimeError Unified error wrapping ArchError, GgufError, QuantError

Usage

use oxillama_runtime::{InferenceEngine, SamplerConfig, RuntimeResult};

fn generate(model_path: &str, prompt: &str) -> RuntimeResult<String> {
    let engine = InferenceEngine::from_gguf(model_path)?;

    let sampler = SamplerConfig::builder()
        .temperature(0.8)
        .top_p(0.95)
        .max_new_tokens(256)
        .build();

    let output = engine.generate(prompt, &sampler)?;
    Ok(output)
}

Feature Flags

Feature Default Description
llama yes LLaMA 2/3/4 architecture
qwen3 yes Qwen3 architecture
mistral yes Mistral / Mixtral architecture
gemma yes Gemma 2/3 architecture
phi yes Phi-3/4 architecture
command-r yes Command-R architecture
starcoder yes StarCoder 2 architecture
tokenizer-onig yes HF tokenizers with Oniguruma regex (recommended for desktop)
tokenizer-wasm no HF tokenizers with pure-Rust regex (required for WASM)

License

Apache-2.0 — COOLJAPAN OU (Team Kitasan)