wasmicro
Tiny multilingual transformer inference for the web.
A 93 KB WebAssembly bundle that runs WordPiece + BERT inference in any
JavaScript environment — browser, Node, Cloudflare Workers, Electron — or
natively from Rust. WordPiece tokenization and BERT forward outputs match
HuggingFace transformers to within f32 round-off (1e-6) on every input
we have tested, including Russian, Chinese, and Spanish.
What works today — and what we verified against
| Component | Verified against | Result |
|---|---|---|
| BERT encoder forward | sentence-transformers/all-MiniLM-L6-v2 via HuggingFace transformers |
max abs error 1e-6, cosine 1.000000 |
| WordPiece tokenizer | bert-base-multilingual-cased on 8 RU / ZH / ES / EN / mixed cases |
8 / 8 exact id match |
| End-to-end semantic search | 3 queries × 6 documents | 3 / 3 queries rank expected document at top-1 |
| WASM bundle | wasm-opt -Oz on release build |
93 KB |
Reproducible from the wasmicro-verify sub-project — every
claim in this README is backed by a binary that downloads the real model and
compares numbers.
Install
Rust
[]
= "0.2.2"
JavaScript
import init from "wasmicro";
await ;
// Fetch model.safetensors + vocab.txt from your CDN of choice.
const modelBytes = ;
const vocabBytes = ;
const tokenizer = ;
const model = ;
const embedding = model.;
console.log; // 384
The shipped .wasm is 93 KB. Compared to common alternatives the engine is
18×–250× smaller; the model file is unchanged.
| Runtime | WASM/JS payload |
|---|---|
| wasmicro | 93 KB |
| Candle WASM | 1.5–5 MB |
| transformers.js | ~10 MB |
| ONNX Runtime Web | 8–20 MB |
Quick start (Rust)
use fs;
use ;
Multilingual
The WordPiece tokenizer is Unicode-aware:
- Splits each CJK ideograph into its own token (matches HuggingFace).
- Lowercases via Unicode (
char::to_lowercase) — handles Cyrillic, Greek, Latin Extended, etc. - Recognises Unicode whitespace (NBSP, ideographic space, …).
- Treats Unicode punctuation as its own token (CJK comma, Spanish
¿/¡, French guillemets, …).
To work with non-English text, use a multilingual vocabulary. Example:
use ;
let vocab = read?;
let tokenizer = from_vocab_bytes_with_options?;
let encoded = tokenizer.encode?;
// -> [CLS] При ##вет , мир ! [SEP]
Accent stripping (NFD + combining-mark removal) is not implemented; pick
a *-cased multilingual vocabulary if your inputs contain accents.
Get a model
# Build the converter once.
# Download all-MiniLM-L6-v2 from the HuggingFace Hub.
# Optional: also write model.i8.safetensors with weight-only int8 quantization.
Resulting directory:
models/mini-lm/
├── model.safetensors (~87 MB, ready for ModelFile::parse)
├── model.i8.safetensors (optional, --quantize i8)
├── config.json
├── vocab.txt (pass to WordPieceTokenizer::from_vocab_bytes)
└── tokenizer.json
Building from source
# Native: tests, benchmarks, examples.
# WASM bundle (SIMD128 is enabled automatically by .cargo/config.toml).
# Optional: repeatable size report.
# Serve the demo locally.
&&
.cargo/config.toml sets target-feature=+simd128 for wasm32-unknown-unknown,
so every wasm-pack build ships SIMD128 kernels. To target very old browsers
(<2022), pass RUSTFLAGS="-C target-feature=-simd128".
Verification
The wasmicro-verify sibling project is the source of
truth for every numeric claim above.
# 1. Generate HuggingFace reference outputs (Python + transformers).
# 2. Run wasmicro on the same inputs and compare.
Expected outcome — all three exit 0 with detailed per-case reports. CI will
gate releases on these in a future revision.
Project layout
wasmicro/
├── src/ # the library (default deps: bytemuck only)
│ ├── lib.rs
│ ├── tensor.rs # owned f32 tensor + inline shape
│ ├── tokenizer.rs # Unicode WordPiece tokenizer
│ ├── quant.rs # i8, u8 affine, q4 packed quantized tensors
│ ├── loader.rs # safetensors parser (no serde)
│ ├── error.rs
│ ├── ops/ # matmul (+SIMD128), attention, layernorm, …
│ ├── models/
│ │ └── bert.rs # BertModel + forward + from_safetensors
│ └── wasm.rs # wasm-bindgen surface (feature = "wasm")
├── tools/
│ ├── wasmicro-convert/ # CLI to download, validate, quantize HF models
│ └── measure-size.ps1 # WASM/npm size report
├── tests/ # integration tests via the public API
├── examples/ # runnable demos
├── demo/ # static site deployed to GitHub Pages
├── .cargo/config.toml # enables SIMD128 by default for wasm32
└── .github/workflows/ # CI + Pages deploy
Design rules
These are non-negotiable. Code that breaks them gets reverted.
- Tiny WASM bundle. Current: 93 KB. Cap: 250 KB after
wasm-opt -Oz. - Forward only. No autograd, no optimizers, no training state.
- Owned tensors.
Vec<f32>. NoRc, noRefCell. - Minimal dependencies. The library's default build pulls in only
bytemuck. Nondarray, nocandle, norayon, noserde_json, nochrono. Thewasmicro-convertCLI is a separate crate with its own deps (hf-hub, etc.) and never ships in the WASM. - The host owns bytes.
ModelFile::parse(&[u8])— same code path for files, fetches,mmap, orArrayBuffer. - Ops are free functions. Layers are functions, not objects.
Honest limitations
- Only the BERT encoder architecture is supported. No GPT, T5, Whisper, ViT, CLIP, or any decoder/encoder-decoder model yet.
- No accent stripping (NFD + mark removal). Use
*-casedmultilingual vocabularies if your inputs include accents. - No batching. Encoding multiple sentences runs them sequentially.
- CPU only. No WebGPU backend; matmul uses naive
ikjwith WASM SIMD128 inner kernels. Production-scale throughput is not the target. - No zero-config import. You must download the model, copy
vocab.txt, and pass the config fields explicitly. Higher-level pipelines (à lapipeline('feature-extraction', '...')) are not provided.
If any of these matter for your use case, prefer transformers.js or Candle — they are far more feature-complete.
Roadmap
- Project skeleton
- Plain tensor + inline shape
- Forward ops: matmul, linear, embedding, softmax, layernorm, GELU/SiLU/ReLU
- safetensors loader with no
serde - Multi-head attention + mean pooling
- BERT encoder forward +
from_safetensors - Numerical parity with HuggingFace on
all-MiniLM-L6-v2(1e-6) - HuggingFace → wasmicro converter CLI
- WordPiece tokenizer with Unicode awareness (CJK split, Unicode case)
- Multilingual parity test against
bert-base-multilingual-cased(8/8) - Weight-only quantized linear ops:
i8, affineu8, packedq4 - Quantized BERT linear loading (
i8,u8/q8) - Converter quantization pipeline (
--quantize i8) - WASM SIMD128 kernels for
matmulandmatmul_t_b - End-to-end semantic-search verifier (text → embedding → ranking)
- CI + GitHub Pages deploy workflow
- WASM demo page
- Published to crates.io and npm
- Live demo with a downloadable model bundle on GitHub Pages
- NFD accent-stripping path for uncased multilingual vocabularies
- Zero-config import:
wasmicro::embed("text")with auto-fetch of HF assets - Browser benchmark numbers (tokens/s on M-series, mid-tier x86, Android)
- GPT-2 + KV-cache
- WebGPU backend
License
MIT OR Apache-2.0