wasmicro
Tiny transformer inference for the web. One file. No build step.
A 199 KB WebAssembly bundle that runs BERT, GPT-2, and T5 inference in any
JavaScript environment — browser, Node, Cloudflare Workers, Electron — or
natively from Rust. Model type is auto-detected from config.json; no
hardcoded parameters required.
Outputs match HuggingFace transformers to within f32 round-off on every
input tested.
What works today — verified against real HuggingFace checkpoints
| Component | Checkpoint | Result |
|---|---|---|
| BERT encoder + pooling | bert-base-uncased |
cosine 1.000000, max|Δ| 8.3 × 10⁻⁷ |
| End-to-end semantic search | bert-base-uncased |
3 / 3 queries rank expected doc at top-1 |
| GPT-2 generation | openai-community/gpt2 |
loads & generates, model_type detection OK |
| T5 encoder + decoder | google-t5/t5-small |
encoder shape [seq, 512], all values finite |
| WASM bundle | wasm-opt -Oz on release |
199 KB |
All claims are reproducible from wasmicro-verify — a sibling
project that downloads the real models and compares numbers against PyTorch.
Install
Rust
[]
= "0.3.0"
JavaScript / npm
Quick start
JavaScript — unified WasmPipeline API
import init from "wasmicro";
await ;
// Read four files from disk / fetch from CDN.
const model = ;
const tokenizer = ; // or vocab.json for GPT-2
const config = await .;
// merges.txt is required for GPT-2/T5, pass null for BERT.
const merges = null;
const pipeline = ;
// ── BERT (embedding / semantic search) ────────────────────────────────────────
const emb = pipeline.; // Float32Array [768]
const batch = pipeline.; // Float32Array [2×768]
// ── GPT-2 / T5 (text generation) ──────────────────────────────────────────────
const text = pipeline.;
console.log;
// Detect model type without loading:
console.log; // "bert" | "gpt2" | "t5" | …
WasmPipeline.fromBytes auto-detects model type from config.json and
selects the right tokenizer and architecture automatically.
Rust — Pipeline::from_bytes
use fs;
use Pipeline;
// ── BERT embedding ─────────────────────────────────────────────────────────────
let model_bytes = read?;
let vocab_bytes = read?;
let config_json = read_to_string?;
let pipeline = from_bytes?;
let embedding = pipeline.embed?;
println!; // 768
// ── GPT-2 generation ───────────────────────────────────────────────────────────
let vocab_json = read?;
let merges_bytes = read?;
let config_json = read_to_string?;
let pipeline = from_bytes?;
let text = pipeline.generate?;
println!;
Lower-level API
The full model APIs are also public for advanced use:
use ;
let file = parse?;
let config = from_config_json?;
let model = from_safetensors?;
let logits = model.logits; // [seq, vocab]
Supported models
model_type |
Architecture | Tokenizer | Methods |
|---|---|---|---|
bert, roberta, distilbert, electra |
Encoder | WordPiece (vocab.txt) |
embed, embed_batch |
gpt2, gpt_neo, gpt_neox |
Decoder | Byte-level BPE (vocab.json + merges.txt) |
generate |
t5, mt5, longt5 |
Encoder-decoder | Byte-level BPE (vocab.json + merges.txt) |
generate, encode_t5 |
Note on T5 tokenization: T5's original tokenizer uses SentencePiece, which
is not yet built into wasmicro. Passing BPE-tokenized IDs works for encoder
shape / value checks; for real T5 generation quality, pre-tokenize with
SentencePiece externally and pass raw input_ids via the lower-level API.
Bundle size vs alternatives
| Runtime | WASM/JS payload |
|---|---|
| wasmicro | 199 KB |
| Candle WASM | 1.5 – 5 MB |
| transformers.js | ~10 MB |
| ONNX Runtime Web | 8 – 20 MB |
wasmicro is 8× – 50× smaller than the next-smallest option for the same three model families.
Get a model
# Download any HuggingFace model to a local directory.
# Optional: weight-only int8 quantization (reduces model.safetensors size ~4×).
Building from source
# Native: tests and benchmarks.
# WASM bundle.
# Serve the demo locally.
&&
.cargo/config.toml sets target-feature=+simd128 for wasm32-unknown-unknown.
To target older browsers (<2022), pass RUSTFLAGS="-C target-feature=-simd128".
Verification
The wasmicro-verify sibling project is the source of
truth for every numeric claim in this README.
# Optionally generate Python/PyTorch reference outputs first:
# Run all verifiers:
All four exit 0. The BERT verifier compares hidden-state values numerically;
GPT-2 and T5 perform smoke tests (correct shapes, finite values, non-empty output).
Full numerical comparison is enabled when the corresponding Python reference file
(expected_gpt2.json / expected_t5.json) is present.
Project layout
wasmicro/
├── src/
│ ├── lib.rs # public re-exports
│ ├── tensor.rs # owned f32 tensor with inline shape
│ ├── tokenizer.rs # Unicode WordPiece tokenizer
│ ├── tokenizer/bpe.rs # byte-level BPE tokenizer (GPT-2/RoBERTa)
│ ├── quant.rs # i8, u8 affine, q4 packed quantized tensors
│ ├── loader.rs # zero-copy safetensors parser (no serde)
│ ├── error.rs
│ ├── pipeline.rs # Pipeline::from_bytes — unified entry point
│ ├── ops/ # free-function ops: matmul, attention, layernorm …
│ ├── models/
│ │ ├── bert.rs # BERT encoder + mean/CLS pooling
│ │ ├── gpt2.rs # GPT-2 decoder + greedy generation
│ │ └── t5.rs # T5 encoder-decoder + greedy generation
│ └── wasm.rs # wasm-bindgen surface (feature = "wasm")
├── tools/
│ └── wasmicro-convert/ # CLI to download + quantize HF models
├── demo/ # static demo site (GitHub Pages)
└── ../wasmicro-verify/ # numeric verification harness
Design rules
These are non-negotiable. Code that violates them gets reverted.
- Tiny WASM bundle. Current: 199 KB. Hard cap: 250 KB after
wasm-opt -Oz. - Forward only. No autograd, no optimizers, no training state.
- Owned tensors.
Vec<f32>. NoRc, noRefCell, noArc, noMutex. - Minimal dependencies. Default build pulls in only
bytemuck. Nondarray, nocandle, norayon, noserde_json, nochrono. - The host owns bytes.
Pipeline::from_bytes(&[u8], …)— same code path for disk files, HTTP fetches,mmap, or JSArrayBuffer. - Ops are free functions. Layers are functions, not objects. No dynamic dispatch.
Limitations
- No KV-cache. GPT-2 generation re-runs the full forward pass for each new token — O(n²) in sequence length. Fast enough for short prompts; add a cache if you need long continuations.
- T5 tokenizer. T5's native SentencePiece tokenizer is not yet built in. BPE-tokenized IDs work for encoder shape/value tests; real task-prefix generation requires external SentencePiece tokenization.
- No accent stripping. NFD + combining-mark removal is not implemented.
Use
*-casedmultilingual BERT vocabularies for accented inputs. - CPU only. No WebGPU backend. Matmul uses a naive
ikjloop with optional WASM SIMD128 inner kernels. Not designed for production throughput. - No streaming.
generate()returns the full string only after all tokens are produced.
If these matter for your use case, prefer transformers.js or Candle — they are more feature-complete.
Roadmap
- Tensor engine + safetensors loader (no serde)
- WordPiece tokenizer (Unicode-aware: CJK, Cyrillic, accents)
- Byte-level BPE tokenizer (GPT-2 / RoBERTa compatible)
- BERT encoder forward +
from_config_jsonauto-detection - Numerical parity with HuggingFace BERT (
1e-6max abs error) - Embed batch + end-to-end semantic search verifier
- Weight-only quantization: i8, affine u8, packed q4
- WASM SIMD128 kernels for matmul
- GPT-2 decoder + greedy generation (verified on
openai-community/gpt2) - T5 encoder-decoder + greedy generation (verified on
google-t5/t5-small) - Unified
Pipeline::from_bytesAPI (auto-detects model type) -
WasmPipelineJS class — single entry point for all model families - Published to crates.io and npm
- KV-cache for GPT-2 / GPT-Neo (5–10× generation speedup)
- SentencePiece tokenizer (for T5 task-prefix generation)
- SIMD128 matmul tiling (fill the vector units)
- WebGPU backend
- Zero-config import:
wasmicro::embed("text")with HF asset auto-fetch - Live demo with downloadable model bundle on GitHub Pages
- Browser benchmark: tokens/s on M-series, x86, Android
License
MIT OR Apache-2.0