oxillama-wasm
WebAssembly bindings for OxiLLaMa — GGUF parsing and LLM inference in the browser.
Part of the OxiLLaMa workspace — a Pure Rust LLM inference engine.
What It Provides
- GGUF metadata and tensor catalogue parsing from a
Uint8Arrayin the browser - Typed metadata export via
parseGgufMetadata()returning aGgufMetadataJsobject (arch, context_length, embedding_length, etc.) - Full text generation via
oxillama-runtime(behind theinferencefeature) with optional per-tokenonTokencallback - K-quant dequantization bindings:
dequantQ4_0,dequantQ4K,dequantQ5K,dequantQ6K - Model loading with progress callbacks:
loadModelFromBytesWithProgress(bytes, onProgress) - WebGPU async bridge:
initWebGpuDevice(),webgpuDequantQ4_0Async(),webgpuGemvAsync() - IndexedDB model cache:
cacheModel(),loadCachedModel(),listCachedModels(),deleteCachedModel() - Streaming GGUF load via
GgufChunkLoaderfor incremental byte feeds - Web-worker message-passing API:
parseWorkerMessage()/workerTokenEvent() - Pure-Rust tokenizer backend (
fancy-regex, no Oniguruma C library) — safe forwasm32-unknown-unknown - No SIMD rayon threads — single-threaded, browser-compatible; SIMD128 proposal enabled at compile time
Status
Version: 0.1.2 — Tests: 51 passing
Feature Flags
| Feature | Default | Description |
|---|---|---|
inference |
yes | Include oxillama-runtime for full generation |
console_error_panic_hook |
yes | Pretty panic messages in the browser console |
Build
# Install wasm-pack once
# Build for the browser (ES module)
# Output lands in pkg/
# oxillama_wasm.js — JS glue
# oxillama_wasm_bg.wasm — compiled WebAssembly
Usage (JavaScript)
import init from "./pkg/oxillama_wasm.js";
await ;
// Fetch and parse GGUF metadata (no weights loaded)
const resp = await ;
const bytes = ;
const header = ;
const meta = ;
console.log;
console.log;
// Load model with progress callback (requires `inference` feature)
const engine = await ;
// Generate text with per-token streaming
const output = engine.;
console.log;
// IndexedDB cache — persist across reloads
await ;
const cached = await ;
// WebGPU acceleration (where supported)
await ;
const result = await ;
License
Apache-2.0 — COOLJAPAN OU (Team Kitasan)