rullama 0.2.0 - Docs.rs

# rullama

Browser-resident **Gemma 4 inference** in pure Rust → WebAssembly + WebGPU.
Loads the same GGUF blobs Ollama already has on disk, runs the forward pass on
your local GPU through hand-written WGSL, never touches a remote server.

The intent is a **PWA-pluggable inference engine**, not a port of Ollama-the-server.
Ollama has 275K LOC of Go that wraps llama.cpp via CGO plus model registry, CLI,
conversion tooling, multimodal pipelines — almost none of which apply to a
browser library. What survives the scope cut is the *core inference path* over
Ollama's storage format.

## Workspace

A two-crate Cargo workspace:

| Crate              | Path                       | Target       | Status   |
|--------------------|----------------------------|--------------|----------|
| `rullama`          | `crates/rullama`           | wasm + native | release-track |
| `rullama-finetune` | `crates/rullama-finetune`  | native only  | experimental (M3 skeleton — see below) |

The iOS bench harness (`tools/ios-bench`) is a sibling crate excluded from the
workspace so `cargo build --workspace --target wasm32-unknown-unknown` doesn't
try to compile its staticlib for wasm.

## What works today

- ✅ **`gemma4:e2b` text inference on the desktop** loads end-to-end and
  generates greedy output bit-identical to Ollama. (`gemma4:e4b` is
  shape-compatible — pull and try it.)
- ✅ **`gemma4:e2b` text inference on iPhone** — full Q4_K_M model loaded
  into iPhone 16e (A18, 8 GB shared RAM) and streaming tokens at ~4.65 tok/s
  via a Dedicated Worker + sync OPFS path. *Multimodal towers stay
  Mac-only for now; mobile picks the text-only loader (`max_context=512`).*
- ✅ **Vision + audio multimodal** on the desktop. ViT (16 blocks, 768
  hidden) + Conformer (12 blocks, 1024 hidden) towers run on the same wgpu
  device as the text path; soft tokens splice into the prompt via
  `<|image>` / `<|audio>` sentinels. Validated bit-identical to Ollama on
  a fixed image and a 30-second pangram WAV.
- ✅ **Q4_K + Q6_K + F16 + F32** quants (the actual mix in `gemma4:e2b` Q4_K_M).
- ✅ **Streaming load** via HTTP byte-range requests *or* OPFS sync access
  handles — the 7 GB GGUF never enters wasm linear memory in bulk. The
  PWA writes to OPFS once via `FileSystemSyncAccessHandle.write()` in a
  worker, and reads tile-by-tile during inference, so the wasm peak stays
  in the tens of MiB regardless of model size.
- ✅ **Multi-turn chat** with system prompt, mid-generation Stop, persistent
  KV cache.
- ✅ **Encoder chained + per-layer submits** (M7 + M15) — one CommandEncoder
  spans each transformer layer, submitted incrementally so the GPU drains
  smoothly even on tight-RAM phones.
- 🧪 **Local LoRA fine-tuning (`rullama-finetune`, native-only)** — backward
  kernels for matmul Q4_K / Q6_K, rmsnorm, rope, geglu, attention,
  cross-entropy; Adam optimizer over GPU buffers; rank-r LoRA on attention
  + FFN projections. M3 acceptance: 200-step overfit-one drops loss from
  ~12.5 → 0 on the dev fixture. Per-position CE (single-forward variant)
  shipped in M3. Wasm32 trainer is **not** in scope.
- ❌ MoE `gemma4:26b` / `gemma4:31b` — out of scope.
- ❌ Other architectures (llama, mistral, qwen, phi).
- 🛠️ **Mobile multimodal** — desktop multimodal works; the iPhone loader
  currently skips the vision/audio towers to fit in shared RAM. Lazy
  upload for those is a follow-up.

## Quickstart

You need:
- Rust ≥ 1.91 + `wasm-pack` (`cargo install wasm-pack --locked --version 0.13.1`)
- A WebGPU-capable browser (Chrome 113+, Edge 113+, recent Firefox; iOS
  Safari 17.4+ for phones)
- Ollama installed locally with `gemma4:e2b` pulled (`ollama pull gemma4:e2b`)

### Build the wasm bundle

```sh
# wasm-pack runs against the rullama crate inside the workspace; the
# output lands at `<repo>/pkg/` so both example PWAs share one bundle.
wasm-pack build crates/rullama --target web --release --out-dir ../../pkg
```

This emits `pkg/rullama.js` + `pkg/rullama_bg.wasm` + TypeScript typings.

### Two example PWAs

The repo ships two browser harnesses against the same wasm bundle:

| Path             | Stack                | Purpose                                                         |
|------------------|----------------------|-----------------------------------------------------------------|
| `examples/web/`  | React + Vite + Tailwind + Workbox SW | Production-quality chat PWA — service-worker-based offline shell, restart dialog on deploys, attachment UI, conversation history in OPFS + SQLite (`rsqlite-wasm`). |
| `examples/pwa/`  | Vanilla JS + bash    | Bench harness and `safaridriver`-driven scripted iPhone runs.   |

Pick `examples/web/` for hacking on the user-facing chat experience; pick
`examples/pwa/` for kernel benchmarks or hands-off iPhone perf reruns.

```sh
# React / Vite PWA — auto-runs the wasm bundle build via `pnpm dev`.
cd examples/web
pnpm install
pnpm dev                 # https://localhost:5173/

# Vanilla bench / iPhone harness — needs the wasm bundle built first.
CERT_FILE=~/.local/share/rullama/cert.pem KEY_FILE=~/.local/share/rullama/key.pem \
    ./examples/pwa/serve.sh
# Desktop:  https://localhost:8088/examples/pwa/index.html
# iPhone:   https://<mac-lan-ip>:8088/examples/pwa/index.html
```

The first load streams the ~7 GB blob from the local Ollama install (or an R2
mirror — see `scripts/upload-models-to-r2.sh`) through a Dedicated Worker that
owns a `FileSystemSyncAccessHandle` over OPFS. Bytes go network → sync handle
→ disk without ever pinning a Blob in the JS heap. Subsequent loads (within
the same Safari session) reuse the cached file.

### iPhone scripted runs

The vanilla PWA is fully drivable from the Mac via Apple's `safaridriver`:

```sh
# One-time setup on the phone:
#   Settings → Safari → Advanced → Remote Automation = on
#                                  Web Inspector       = on
#                                  Feature Flags → WebGPU = on
# Then on the Mac:
safaridriver -p 4444 &
./examples/pwa/iphone-session-keeper.sh &        # keep an OPFS scope alive
./examples/pwa/run-on-iphone.sh                  # navigate → Load → chat → log perf
./examples/pwa/clean-iphone.sh                   # wipe OPFS between trials
```

`/tmp/rullama-page.log` collects beacon traces from the page (`[chat]`,
`[pe]`, `[tg]`, `[gen]`, `[wkr]`, `[rs]`) so any regression in a phone
run leaves a server-side trail even after a WebContent crash.

## Docker / deploy

`compose.yaml` packages the built PWA + a model-blob HTTP service behind
nginx, designed to sit behind Cloudflare. The Cargo workspace ships
`cargo docker:*` aliases (dispatched through the `xtask` crate) so the
deploy loop doesn't need shell aliases:

| Alias                  | Effective command                                                              |
|------------------------|--------------------------------------------------------------------------------|
| `cargo docker:build`   | `docker compose build`                                                         |
| `cargo docker:start`   | `docker compose up -d`                                                         |
| `cargo docker:stop`    | `docker compose down`                                                          |
| `cargo docker:restart` | `docker compose build --no-cache` then `docker compose up -d --force-recreate` |
| `cargo docker:logs`    | `docker compose logs -f --tail=200`                                            |
| `cargo docker:ps`      | `docker compose ps`                                                            |

First run compiles `xtask` (~1 s); subsequent invocations reuse the cached
binary. Add new tasks by appending a match arm in `xtask/src/main.rs` and
a corresponding line in `.cargo/config.toml`. The compose file's
`OLLAMA_MODELS_DIR` env var picks the host's model store; defaults to
`/usr/share/ollama/.ollama/models`.

## Native sanity checks

The same code paths run natively against host wgpu (Metal on macOS, Vulkan on
Linux). Useful for parity testing without a browser:

```sh
# Greedy parity vs Ollama (CPU oracle)
cargo run -p rullama --release --features cpu-reference --example greedy_parity -- \
    ~/.ollama/models/blobs/sha256-<digest> "Hi" 5

# Full-stack chat through the public Model API
cargo run -p rullama --release --features cpu-reference --example model_api -- \
    ~/.ollama/models/blobs/sha256-<digest> "Hi" --greedy --max=16

# Standalone chained forward (M7 perf path)
cargo run -p rullama --release --features cpu-reference --example chained_smoke -- \
    ~/.ollama/models/blobs/sha256-<digest> "Hi" --max=8
```

`--features cpu-reference` is now a no-op (the f32 oracle is always built); the
flag is kept so existing scripts keep working.

## Fine-tuning (preview)

`rullama-finetune` runs LoRA SGD against the live wgpu kernels — no Burn, no
PyTorch, no separate runtime. M3 scope: rank-r LoRAs on
`attn_q` / `attn_k` / `attn_v` / `attn_o` and the FFN projections, Adam, global
L2 grad clipping, gradient accumulation, mixed precision, gradient
checkpointing. PerPosition CE is a single-forward variant with a ~C/2 speedup
vs. the multi-forward path. Wasm32 builds compile to an empty crate.

```sh
# Overfit a single (prompt, target) pair — acceptance test that the
# backward path and Adam are wired correctly.
cargo run -p rullama-finetune --release --example overfit_one -- \
    ~/.ollama/models/blobs/sha256-<digest>

# Train on a JSONL dataset. See `crates/rullama-finetune/examples/data/echo.jsonl`
# for the format; env knobs documented in the example's docstring.
cargo run -p rullama-finetune --release --example train_jsonl -- \
    ~/.ollama/models/blobs/sha256-<digest> \
    crates/rullama-finetune/examples/data/echo.jsonl
```

Currently the trainer **mutates LoRAs in place against the loaded model** —
there is no on-disk checkpoint format yet, no adapter export, no inference
hand-off back to the wasm runtime. That is the next milestone.

## Architecture

```
PWA (host page) ──┐
                  ▼  postMessage RPC
  ┌──────────────────────────────────────────────────────────────────┐
  │ inference-worker.js (Dedicated Worker)                          │
  │   ▶ owns FileSystemSyncAccessHandle for the GGUF                │
  │   ▶ owns the wasm Model handle                                  │
  │     ┌──────────────────────────────────────────────────────┐    │
  │     │ wasm32 (Rust, the rullama crate)                     │    │
  │     │   Model.loadFromOpfs(read_fn, total)                 │    │
  │     │           │                                          │    │
  │     │           ▼                                          │    │
  │     │   GgufReader (header only, ~5 MB)                    │    │
  │     │           │                                          │    │
  │     │           │ TensorFetcher (OPFS sync read | HTTP Range)│
  │     │           ▼                                          │    │
  │     │   WeightCache  ─────────▶  Forward / VisionForward / │    │
  │     │   (lazy GPU upload,         GpuAudioForward          │    │
  │     │    per-tile range fetch     (per-layer encoder       │    │
  │     │    on big tensors)           submits, GPU-resident   │    │
  │     │                              KV cache)               │    │
  │     │                                  │                   │    │
  │     │                                  ▼                   │    │
  │     │                      wgpu (WebGPU / Metal / Vulkan)  │    │
  │     │                                  │                   │    │
  │     │                                  ▼                   │    │
  │     │      WGSL kernels: matmul Q4_K/Q6_K/F16, rmsnorm,    │    │
  │     │      rmsnorm_per_row, rope_neox, attention (incl.    │    │
  │     │      HPD-f16 + block-local + subgroup variants),     │    │
  │     │      conv2d, geglu, softcap, residual_add, scale,    │    │
  │     │      top_k, quick_gelu, plus backward kernels for    │    │
  │     │      training (cross_entropy, rmsnorm, rope, geglu,  │    │
  │     │      attention dQ / dKV, matmul Q4_K / Q6_K, Adam)   │    │
  │     └──────────────────────────────────────────────────────┘    │
  └──────────────────────────────────────────────────────────────────┘
                  │
                  ▲  postMessage replies (tokens, errors)
PWA renders tokens, manages chat history, handles attachments.
```

The Worker move (M15) is what unblocked iPhone inference: iOS Safari only
exposes `FileSystemSyncAccessHandle` in Worker contexts, and the Worker
isolates inference from main-thread page-watchdog reapers.

The reference Go implementation lives in Ollama's tree under
`model/models/gemma4/`. Every op in `crates/rullama/src/reference/forward.rs`
(CPU oracle), `forward_chained.rs` (production GPU forward),
`multimodal/vision.rs`, and `multimodal/audio.rs` corresponds 1:1.

## Performance

Measurements as of M15:

| Target                   | Steady-state tok/s (gen) | Notes                                  |
|--------------------------|--------------------------|----------------------------------------|
| iPhone 16e (A18, iOS 26) | **~4.65 tok/s**          | text-only, `max_context=512`           |
| AMD Radeon Pro 555 (Mac) | ~1 tok/s (M7 baseline)   | naive kernels, tiled matmul deferred   |

The architectural foundation (chained encoder, GPU-resident KV cache, per-layer
submits, per-tile range fetch from OPFS) is in place. Inference kernels are
still naive matvec; reaching ≥10 tok/s on both Mac and phone needs tiled
matmul + bind-group caching + kernel fusion (the M8 line on the roadmap).

The iPhone A18 advertises 1 GiB for both `max_buffer_size` and
`max_storage_buffer_binding_size` — four times the WebGPU spec floor — so
there's real headroom for fewer/larger weight buffers (currently 455 of
them resident, see M15 follow-ups).

Other capability notes captured during iPhone validation:
- `shader-f16` ✓ — packed FP16 MADs engage on A18.
- `timestamp-query` ✓ — Pro 555 doesn't expose this; could wire GPU-side
  per-pass timing.
- `subgroups` ✗ — A18 has SIMDgroup hardware but Safari's WebGPU doesn't
  surface WGSL subgroup ops yet. Vision attention falls through to the
  no-subgroup HPD-f16 kernel automatically.

## Layout

```
crates/rullama/
├── src/
│   ├── api.rs                    # JS-facing Model: load / loadFromUrl / loadFromOpfs[TextOnly]
│   ├── backend/
│   │   ├── context.rs            # WgpuCtx (device, queue, adapter limits)
│   │   ├── dispatch.rs           # cached + chained kernel dispatchers (incl. backward + Adam)
│   │   ├── pipelines.rs          # one ComputePipeline per kernel (built once)
│   │   ├── weight_cache.rs       # lazy GPU upload, per-tile range fetch on big tensors
│   │   ├── matmul.rs / elementwise.rs / spike.rs    # one-shot dispatchers (parity tests)
│   ├── gguf/
│   │   ├── reader.rs             # GGUF v3 parser (header + tensor descriptors)
│   │   ├── fetcher.rs            # TensorFetcher trait + In-memory / HttpRange / Opfs impls
│   │   ├── tensor.rs             # dequant_tensor_to_f32 / dequant_row_to_f32 (sync + async)
│   │   ├── quant.rs / dtype.rs / value.rs
│   ├── kernels/wgsl/             # 70+ hand-written compute shaders (text + vision + audio + backward)
│   ├── model/config.rs           # Gemma4Config: parses gemma4.* metadata keys
│   ├── multimodal/
│   │   ├── vision.rs             # ViT forward (16 blocks, 768d, ClippableLinear)
│   │   ├── audio.rs              # Conformer forward (12 blocks, 1024d, block-local attention)
│   │   └── audio_features.rs     # WAV → 128-bin log-mel (realfft)
│   ├── reference/
│   │   ├── forward.rs            # CPU f32 forward (parity oracle)
│   │   ├── forward_gpu.rs        # M3-era GPU forward with per-kernel readbacks (oracle)
│   │   ├── forward_chained.rs    # M7 production GPU forward, per-layer submits (M15)
│   │   ├── ops.rs / weights.rs
│   ├── sampling.rs               # temperature, top-k, top-p, rep penalty
│   ├── template/gemma4_small.rs  # chat-template renderer (matches Ollama)
│   └── tokenizer/                # GGUF BPE tokenizer (Ollama-bit-exact)
└── examples/
    ├── greedy_parity.rs          # CPU forward greedy vs Ollama
    ├── chained_smoke.rs          # standalone Forward driver
    ├── model_api.rs              # public Model API end-to-end
    ├── vision_parity.rs          # vision tower vs Ollama (M11)
    ├── audio_parity.rs           # audio tower vs Ollama (M13)
    ├── matmul_bench.rs           # native wgpu matmul microbench
    └── inspect.rs / decode_ids.rs / encode_check.rs / list_tensors.rs / …

crates/rullama-finetune/
├── src/
│   ├── shared/                   # vendored config / error / progress types
│   ├── dataset_loader.rs         # JSONL parser + Tokenizer trait
│   ├── lr_schedule.rs            # warmup + linear / cosine / cosine-warm-restarts
│   ├── lora.rs                   # per-LoRA GPU state (A / B), grad buffers
│   ├── scratch.rs                # per-step GPU scratch buffers for backward
│   └── session.rs                # TrainingSession — forward → loss → backward → Adam
└── examples/
    ├── overfit_one.rs            # M3 acceptance test
    ├── train_jsonl.rs            # JSONL dataset trainer
    └── data/echo.jsonl

examples/
├── web/                          # React + Vite + Tailwind + Workbox SW production demo
└── pwa/                          # Vanilla JS bench harness + safaridriver scripts
    ├── index.html / bench.html
    ├── inference-worker.js       # Dedicated Worker — owns Model + sync OPFS handle
    ├── opfs-store.js             # OPFS download + read API (main-thread)
    ├── opfs-writer-worker.js     # streams GGUF → OPFS via SyncAccessHandle.write
    ├── serve.sh                  # dev HTTPS server + /api/log /api/blob endpoints
    ├── run-on-iphone.sh / iphone-session-keeper.sh / clean-iphone.sh
    └── bench-on-iphone.sh

tools/ios-bench/                  # staticlib for Xcode — C-ABI rullama_run_bench
docker/                           # nginx + R2 mirror configs
scripts/                          # ops scripts (model upload, etc.)
```

## License

Dual-licensed under either of:

- Apache License 2.0 ([LICENSE-APACHE](./LICENSE-APACHE))
- MIT License ([LICENSE-MIT](./LICENSE-MIT))

at your option.

Contributions are accepted under the same dual-license terms.