oxillama 0.1.3

Pure Rust LLM inference engine — the sovereign alternative to llama.cpp (meta crate)
Documentation
# OxiLLaMa Recipes

Task-oriented code recipes for common OxiLLaMa use-cases.
Each recipe is a self-contained, compilable snippet (not runnable without a real GGUF file).

---

## Recipe 1 — Load a GGUF and generate 100 tokens

Stream 100 tokens from a locally-stored GGUF model to stdout.

```rust,no_run
use oxillama::runtime::{EngineConfig, InferenceEngine, SamplerConfig};

fn main() -> anyhow::Result<()> {
    let config = EngineConfig {
        model_path: "llama-3-8b-instruct.Q4_K_M.gguf".to_string(),
        num_threads: 4,
        ..EngineConfig::default()
    };

    let mut engine = InferenceEngine::new(config);
    engine.load_model()?;

    let sampler = SamplerConfig {
        temperature: 0.8,
        top_k: 40,
        ..SamplerConfig::default()
    };

    let prompt = "The Rust programming language is great because";

    engine.generate_with_config(prompt, 100, sampler, |tok| {
        use std::io::Write as _;
        print!("{tok}");
        let _ = std::io::stdout().flush();
    })?;
    println!();

    Ok(())
}
```

---

## Recipe 2 — Serve an OpenAI-compatible API on port 8080

Spin up an HTTP server that accepts `/v1/completions` and `/v1/chat/completions`
requests from any OpenAI client (requires the `server` feature, enabled by default).

```rust,no_run
#[cfg(feature = "server")]
fn main() -> anyhow::Result<()> {
    use std::sync::Arc;
    use oxillama::runtime::{EngineConfig, InferenceEngine};
    use oxillama::server::{
        build_app, spawn_inference_worker, AppState, ServerConfig,
    };

    // 1. Load the model.
    let engine_config = EngineConfig {
        model_path: "llama-3-8b-instruct.Q4_K_M.gguf".to_string(),
        num_threads: 8,
        ..EngineConfig::default()
    };
    let mut engine = InferenceEngine::new(engine_config);
    engine.load_model()?;

    // 2. Cache read-only metadata before the engine moves into the worker.
    let cached_sampler = engine.config().sampler.clone();
    let hidden_size = engine.hidden_size().unwrap_or(0);
    let vocab_bytes = engine.vocab_bytes().map(Arc::new);

    // 3. Spawn the background inference worker.
    //    Route handlers communicate with it via an mpsc channel.
    let (tx, rx) = tokio::sync::mpsc::channel(64);
    let prefix_cache = Arc::new(std::sync::Mutex::new(
        oxillama::runtime::PrefixKvCache::new(oxillama::runtime::PrefixCacheConfig::default()),
    ));
    let loras: Arc<std::sync::RwLock<std::collections::HashMap<
        String,
        Arc<oxillama::arch::lora::LoadedLora>,
    >>> = Arc::new(std::sync::RwLock::new(std::collections::HashMap::new()));
    spawn_inference_worker(engine, rx, prefix_cache, loras);

    // 4. Build shared state + axum router.
    let state = Arc::new(AppState::new(
        tx,
        "llama-3-8b-instruct".to_string(),
        cached_sampler,
        vocab_bytes,
        hidden_size,
    ));
    let app = build_app(Arc::clone(&state));

    // 5. Block on the Tokio runtime.
    let server_config = ServerConfig {
        host: "127.0.0.1".to_string(),
        port: 8080,
        ..ServerConfig::default()
    };
    let bind_addr = format!("{}:{}", server_config.host, server_config.port);

    let rt = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()?;

    rt.block_on(async move {
        let listener = tokio::net::TcpListener::bind(&bind_addr).await?;
        axum::serve(listener, app)
            .with_graceful_shutdown(oxillama::server::shutdown_signal())
            .await
            .map_err(anyhow::Error::from)
    })
}

#[cfg(not(feature = "server"))]
fn main() {
    eprintln!("Enable the `server` feature: cargo run --features server");
}
```

---

## Recipe 3 — Load a LoRA adapter at runtime

Load a LoRA adapter from a GGUF file, push it onto the engine's LoRA stack,
and generate text with the adapted model. Swap to a second adapter without
reloading the base model.

```rust,no_run
use std::sync::Arc;
use oxillama::runtime::{EngineConfig, InferenceEngine};
use oxillama::arch::lora::LoadedLora;

fn main() -> anyhow::Result<()> {
    // 1. Load the base model.
    let config = EngineConfig {
        model_path: "llama-3-8b.gguf".to_string(),
        ..EngineConfig::default()
    };
    let mut engine = InferenceEngine::new(config);
    engine.load_model()?;

    // 2. Load and push the first LoRA adapter.
    let lora_a = Arc::new(LoadedLora::load("adapter-style-a.gguf")?);
    engine.push_lora(Arc::clone(&lora_a), 1.0); // scale = 1.0
    engine.apply_lora_stack()?;

    engine.generate("Describe Paris in one sentence", 60, |tok| print!("{tok}"))?;
    println!();

    // 3. Hot-swap to a second adapter — no model reload needed.
    let _popped = engine.pop_lora();
    let lora_b = Arc::new(LoadedLora::load("adapter-style-b.gguf")?);
    engine.push_lora(lora_b, 0.8); // lower scale for subtle influence
    engine.apply_lora_stack()?;

    engine.generate("Describe Tokyo in one sentence", 60, |tok| print!("{tok}"))?;
    println!();

    // 4. Return to the bare base model.
    engine.clear_loras();

    Ok(())
}
```

---

## Recipe 4 — Run speculative decoding with a smaller draft model

Use a small, fast draft model to propose candidate tokens that are verified by
the large target model, achieving higher throughput with identical output distribution.

```rust,no_run
use oxillama::runtime::{EngineConfig, SpeculativeConfig, SpeculativeEngine};

fn main() -> anyhow::Result<()> {
    let target_config = EngineConfig {
        model_path: "llama-3-70b.Q4_K_M.gguf".to_string(),
        num_threads: 8,
        ..EngineConfig::default()
    };

    let draft_config = EngineConfig {
        model_path: "llama-3-1b.Q4_K_M.gguf".to_string(),
        num_threads: 2,
        ..EngineConfig::default()
    };

    // The draft model proposes 5 tokens per round; the target verifies them.
    let mut spec_config = SpeculativeConfig::new(target_config, draft_config);
    spec_config.num_speculative = 5;

    let mut engine = SpeculativeEngine::new(spec_config)?;

    engine.generate(
        "Explain the Rust borrow checker in simple terms:",
        256,
        |tok| {
            use std::io::Write as _;
            print!("{tok}");
            let _ = std::io::stdout().flush();
        },
    )?;
    println!();

    Ok(())
}
```

---

## Recipe 5 — Snapshot a session and resume it in a new process

Capture the full KV cache + sampler state after a prefill pass, persist it to
disk, then resume from that exact position in a separate process.

```rust,no_run
use std::path::Path;
use oxillama::runtime::{EngineConfig, InferenceEngine};

// ── Process A: generate some tokens then snapshot ────────────────────────────

fn snapshot_session(model_path: &str, snapshot_path: &str) -> anyhow::Result<()> {
    let config = EngineConfig {
        model_path: model_path.to_string(),
        ..EngineConfig::default()
    };
    let mut engine = InferenceEngine::new(config);
    engine.load_model()?;

    // Prefill the prompt into the KV cache.
    let prompt_tokens = engine.tokenize("The capital of France is")?;
    engine.prefill(&prompt_tokens)?;

    // Capture the session snapshot as a byte blob.
    let snapshot_bytes = engine.snapshot()?;
    std::fs::write(snapshot_path, &snapshot_bytes)?;
    println!("Snapshot written to {snapshot_path} ({} bytes)", snapshot_bytes.len());

    Ok(())
}

// ── Process B: resume from the snapshot and continue generation ───────────────

fn resume_session(model_path: &str, snapshot_path: &str) -> anyhow::Result<()> {
    let snapshot_bytes = std::fs::read(snapshot_path)?;

    // resume() validates the model fingerprint against the on-disk file
    // so it rejects mismatched or corrupted model files.
    let mut engine = InferenceEngine::resume(&snapshot_bytes, Path::new(model_path))?;

    // Continue generating from the restored KV position.
    engine.generate("", 50, |tok| print!("{tok}"))?;
    println!();

    Ok(())
}

fn main() -> anyhow::Result<()> {
    let model = "llama-3-8b.gguf";
    let snap = "/tmp/session.oxisnap";
    snapshot_session(model, snap)?;
    resume_session(model, snap)?;
    Ok(())
}
```

---

## Recipe 6 — Build a browser chat app with oxillama-wasm

Run OxiLLaMa entirely in the browser via WebAssembly. Load the model bytes with
`fetch`, initialise the engine from the in-memory buffer, then drive a simple
streaming chat loop from JavaScript.

```js,no_run
// index.js (browser, ES module)
import init, { WasmEngine } from "./oxillama_wasm.js";

async function main() {
  // Initialise the WASM module.
  await init();

  // Fetch the GGUF model into an ArrayBuffer.
  const resp = await fetch("/models/qwen3-0.6b.Q4_K_M.gguf");
  const modelBytes = new Uint8Array(await resp.arrayBuffer());

  // Fetch the tokenizer.json.
  const tokResp = await fetch("/models/tokenizer.json");
  const tokJson = await tokResp.text();

  // Create the engine and load the model from bytes (no filesystem access needed).
  const engine = new WasmEngine();
  await engine.loadModelFromBytes(modelBytes, tokJson);

  const output = document.getElementById("output");
  const prompt = document.getElementById("prompt").value;

  // Stream tokens into the DOM.
  await engine.generate(prompt, 256, (tok) => {
    output.textContent += tok;
  });
}

main().catch(console.error);
```

The Rust side (`oxillama-wasm`) exposes `WasmEngine` via `wasm_bindgen`.
See `crates/oxillama-wasm/src/lib.rs` for the full binding.

---

## Recipe 7 — Resume an interrupted HuggingFace pull

When a large GGUF download is interrupted, use `GgufModel::resume` to validate
the partial file against its checkpoint sidecar (`.oxiresume`) and finish loading
once the download completes.

```rust,no_run
use oxillama::gguf::GgufModel;

fn main() -> anyhow::Result<()> {
    let gguf_path = "/models/llama-3-70b.Q4_K_M.gguf";

    // GgufModel::resume reads the adjacent `.oxiresume` sidecar.
    // Returns Ok(None) if no checkpoint exists (nothing to resume).
    match GgufModel::resume(gguf_path)? {
        None => {
            println!("No resume checkpoint found — starting a fresh download.");
            // ... trigger your download logic here ...
        }
        Some(handle) => {
            println!(
                "Checkpoint validated. Last valid offset: {}",
                handle.checkpoint.last_valid_offset
            );

            if handle.checkpoint.tensors_fully_loaded {
                // The download was already complete; load the model directly.
                // Remove the sidecar before calling finish() because finish()
                // consumes the handle (takes ownership of `self`).
                handle.remove_checkpoint()?;
                let model = handle.finish()?;
                println!(
                    "Model loaded: {} tensors",
                    model.file.header.tensor_count
                );
            } else {
                println!(
                    "Download incomplete (expected {} bytes). \
                     Resume your download from offset {}.",
                    handle.checkpoint.file_size_expected,
                    handle.checkpoint.last_valid_offset
                );
                // ... resume download, then call handle.finish() when done ...
            }
        }
    }

    Ok(())
}
```

---

## Recipe 8 — Load a sharded 70B model

Load a model split across multiple GGUF shards (HuggingFace multi-part format)
as a single unified logical model. Provide the path to any one shard — all
siblings are auto-discovered.

```rust,no_run
use oxillama::gguf::ShardedGgufModel;

fn main() -> anyhow::Result<()> {
    // Shards follow HuggingFace naming: <base>-NNNNN-of-MMMMM.gguf
    // Pass the path to shard 1; the rest are auto-discovered in the same dir.
    let sharded = ShardedGgufModel::load_sharded(
        "/models/llama-3-70b-00001-of-00008.gguf",
    )?;

    println!(
        "Loaded sharded model: {:?}",
        sharded
    );

    // The sharded model exposes the same architecture/tensor API as a single
    // GgufModel. Pass it to your architecture builder as usual.
    // (Use the Runtime's InferenceEngine with a model_path pointing at shard 1
    // when the runtime gains native shard support.)

    Ok(())
}
```