Crate oxillama_wasm

Expand description

WebAssembly bindings for OxiLLaMa.

Exposes GGUF header parsing, Q4_0 dequantization, and full text generation to JavaScript/TypeScript via wasm-bindgen.

§Feature flags

Feature	Default	Description
`inference`	yes	Enables `generate()` via `oxillama-runtime` with
		the pure-Rust `unstable_wasm` tokenizer backend.

§Usage (generate)

import init, { generate } from './oxillama_wasm.js';
await init();

const modelResp = await fetch('model.gguf');
const modelBytes = new Uint8Array(await modelResp.arrayBuffer());
const tokenizerResp = await fetch('tokenizer.json');
const tokenizerJson = await tokenizerResp.text();

// Streaming: pass a callback to receive each token as it is generated.
const text = generate(modelBytes, tokenizerJson, "Hello, world!", 128,
    (token) => process.stdout.write(token));
console.log(text);

Re-exports§

pub use service_worker::get_service_worker_script;
pub use service_worker::register_service_worker;
pub use service_worker::ServiceWorkerOptions;
pub use simd_check::get_simd128_status;
pub use streaming_loader::StreamingGgufLoader;
pub use streaming_loader::StreamingLoadOptions;

Modules§

gpu_bridge: Async WebGPU bridge for OxiLLaMa WASM.
idb_cache: IndexedDB model cache for persisting GGUF model bytes across page reloads.
service_worker: Service-worker registration helper for the OxiLLaMa WASM browser build.
simd_check: SIMD128 capability detection for the OxiLLaMa WASM build.
streaming_load: Streaming / chunked GGUF loader.
streaming_loader: Streaming GGUF loader with LRU tensor cache and on-demand byte-range fetching.
webgpu: WebGPU compute bridge for accelerating dequantization in the browser.
worker: Web-worker message-passing helpers for offloaded inference.

Structs§

GgufMetadataJs: Typed GGUF metadata returned by parse_gguf_metadata.
WasmEngine: Opaque handle wrapping a loaded InferenceEngine for use from JS.

Functions§

dequant_q4_0: Dequantize a buffer of Q4_0 blocks to an array of f32 values.
dequant_q4_k: Dequantize a buffer of Q4_K blocks to an array of f32 values.
dequant_q5_k: Dequantize a buffer of Q5_K blocks to an array of f32 values.
dequant_q6_k: Dequantize a buffer of Q6_K blocks to an array of f32 values.
generate: Run full text generation from an in-memory GGUF model.
init: Initialize the WASM module (sets up panic hook for better error messages).
list_tensor_names: Return all tensor names stored in a GGUF file as a JS array of strings.
load_model_from_bytes_with_progress: Load a GGUF model from raw bytes, reporting progress via an optional JS callback.
parse_gguf_header: Parse a GGUF file header from raw bytes and return key metadata as a JS object.
parse_gguf_metadata: Parse a GGUF file and return typed metadata as a JS object.

Crate oxillama_wasm

Crate oxillama_wasm Copy item path

§Feature flags

§Usage (generate)

Re-exports§

Modules§

Structs§

Functions§

Crate oxillama_wasm