wasmicro

Tiny multilingual transformer inference for the web.

A 93 KB WebAssembly bundle that runs WordPiece + BERT inference in any JavaScript environment — browser, Node, Cloudflare Workers, Electron — or natively from Rust. WordPiece tokenization and BERT forward outputs match HuggingFace transformers to within f32 round-off (1e-6) on every input we have tested, including Russian, Chinese, and Spanish.

What works today — and what we verified against

Component	Verified against	Result
BERT encoder forward	`sentence-transformers/all-MiniLM-L6-v2` via HuggingFace `transformers`	max abs error 1e-6, cosine 1.000000
WordPiece tokenizer	`bert-base-multilingual-cased` on 8 RU / ZH / ES / EN / mixed cases	8 / 8 exact id match
End-to-end semantic search	3 queries × 6 documents	3 / 3 queries rank expected document at top-1
WASM bundle	`wasm-opt -Oz` on release build	93 KB

Reproducible from the wasmicro-verify sub-project — every claim in this README is backed by a binary that downloads the real model and compares numbers.

Install

Rust

[dependencies]
wasmicro = "0.2.2"

JavaScript

npm install wasmicro

import init, { WasmBertModel, WasmWordPieceTokenizer } from "wasmicro";

await init();
// Fetch model.safetensors + vocab.txt from your CDN of choice.
const modelBytes = new Uint8Array(await (await fetch("/model.safetensors")).arrayBuffer());
const vocabBytes = new Uint8Array(await (await fetch("/vocab.txt")).arrayBuffer());

const tokenizer = new WasmWordPieceTokenizer(vocabBytes, /*lowercase=*/ true);
const model = new WasmBertModel(
  modelBytes,
  /* hidden_size */ 384, /* num_layers */ 6, /* num_heads */ 12,
  /* intermediate */ 1536, /* vocab */ 30522, /* max_pos */ 512, /* type_vocab */ 2,
  /* prefix */ "",
);
const embedding = model.embed_text(tokenizer, "hello world", 128);
console.log(`dim=${embedding.length}`); // 384

The shipped .wasm is 93 KB. Compared to common alternatives the engine is 18×–250× smaller; the model file is unchanged.

Runtime	WASM/JS payload
wasmicro	93 KB
Candle WASM	1.5–5 MB
transformers.js	~10 MB
ONNX Runtime Web	8–20 MB

Quick start (Rust)

use std::fs;
use wasmicro::{
    models::bert::{BertConfig, BertModel},
    ModelFile, WordPieceTokenizer,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model_bytes = fs::read("models/mini-lm/model.safetensors")?;
    let vocab_bytes = fs::read("models/mini-lm/vocab.txt")?;

    let file = ModelFile::parse(&model_bytes)?;
    let tokenizer = WordPieceTokenizer::from_vocab_bytes(&vocab_bytes)?;
    let model = BertModel::from_safetensors(&file, BertConfig::mini_lm_l6_v2(), "")?;

    let embedding = model.embed_text(&tokenizer, "hello world", 128)?;
    println!("embedding dim: {:?}", embedding.shape().as_slice());
    Ok(())
}

Multilingual

The WordPiece tokenizer is Unicode-aware:

Splits each CJK ideograph into its own token (matches HuggingFace).
Lowercases via Unicode (char::to_lowercase) — handles Cyrillic, Greek, Latin Extended, etc.
Recognises Unicode whitespace (NBSP, ideographic space, …).
Treats Unicode punctuation as its own token (CJK comma, Spanish ¿/¡, French guillemets, …).

To work with non-English text, use a multilingual vocabulary. Example:

use wasmicro::tokenizer::{WordPieceOptions, WordPieceTokenizer};

let vocab = std::fs::read("models/multilingual/vocab.txt")?;
let tokenizer = WordPieceTokenizer::from_vocab_bytes_with_options(
    &vocab,
    // bert-base-multilingual-cased: keep case, no accent stripping.
    WordPieceOptions { lowercase: false, max_input_chars_per_word: 100 },
)?;
let encoded = tokenizer.encode("Привет, мир!", 32)?;
// -> [CLS] При ##вет , мир ! [SEP]

Accent stripping (NFD + combining-mark removal) is not implemented; pick a *-cased multilingual vocabulary if your inputs contain accents.

Get a model

# Build the converter once.
cargo build --release -p wasmicro-convert

# Download all-MiniLM-L6-v2 from the HuggingFace Hub.
./target/release/wasmicro-convert \
    sentence-transformers/all-MiniLM-L6-v2 \
    ./models/mini-lm

# Optional: also write model.i8.safetensors with weight-only int8 quantization.
./target/release/wasmicro-convert \
    sentence-transformers/all-MiniLM-L6-v2 \
    ./models/mini-lm \
    --quantize i8

Resulting directory:

models/mini-lm/
├── model.safetensors      (~87 MB, ready for ModelFile::parse)
├── model.i8.safetensors   (optional, --quantize i8)
├── config.json
├── vocab.txt              (pass to WordPieceTokenizer::from_vocab_bytes)
└── tokenizer.json

Building from source

# Native: tests, benchmarks, examples.
cargo test --workspace
cargo run --example load_safetensors

# WASM bundle (SIMD128 is enabled automatically by .cargo/config.toml).
wasm-pack build --release --target web --no-opt \
    --out-dir demo/pkg --out-name wasmicro \
    . -- --features wasm

wasm-opt --enable-bulk-memory --enable-nontrapping-float-to-int --enable-simd \
    -Oz demo/pkg/wasmicro_bg.wasm -o demo/pkg/wasmicro_bg.wasm

# Optional: repeatable size report.
powershell -ExecutionPolicy Bypass -File tools/measure-size.ps1

# Serve the demo locally.
cd demo && python -m http.server 8080

.cargo/config.toml sets target-feature=+simd128 for wasm32-unknown-unknown, so every wasm-pack build ships SIMD128 kernels. To target very old browsers (<2022), pass RUSTFLAGS="-C target-feature=-simd128".

Verification

The wasmicro-verify sibling project is the source of truth for every numeric claim above.

cd ../wasmicro-verify

# 1. Generate HuggingFace reference outputs (Python + transformers).
python python/reference.py
python python/multilingual_tokens.py

# 2. Run wasmicro on the same inputs and compare.
cargo run --release --bin wasmicro-verify     # BERT forward vs HF
cargo run --release --bin e2e_search          # text -> tokenize -> embed -> rank
cargo run --release --bin multilingual_tokens # WordPiece vs HF on RU/ZH/ES

Expected outcome — all three exit 0 with detailed per-case reports. CI will gate releases on these in a future revision.

Project layout

wasmicro/
├── src/                          # the library (default deps: bytemuck only)
│   ├── lib.rs
│   ├── tensor.rs                 # owned f32 tensor + inline shape
│   ├── tokenizer.rs              # Unicode WordPiece tokenizer
│   ├── quant.rs                  # i8, u8 affine, q4 packed quantized tensors
│   ├── loader.rs                 # safetensors parser (no serde)
│   ├── error.rs
│   ├── ops/                      # matmul (+SIMD128), attention, layernorm, …
│   ├── models/
│   │   └── bert.rs               # BertModel + forward + from_safetensors
│   └── wasm.rs                   # wasm-bindgen surface (feature = "wasm")
├── tools/
│   ├── wasmicro-convert/         # CLI to download, validate, quantize HF models
│   └── measure-size.ps1          # WASM/npm size report
├── tests/                        # integration tests via the public API
├── examples/                     # runnable demos
├── demo/                         # static site deployed to GitHub Pages
├── .cargo/config.toml            # enables SIMD128 by default for wasm32
└── .github/workflows/            # CI + Pages deploy

Design rules

These are non-negotiable. Code that breaks them gets reverted.

Tiny WASM bundle. Current: 93 KB. Cap: 250 KB after wasm-opt -Oz.
Forward only. No autograd, no optimizers, no training state.
Owned tensors. Vec<f32>. No Rc, no RefCell.
Minimal dependencies. The library's default build pulls in only bytemuck. No ndarray, no candle, no rayon, no serde_json, no chrono. The wasmicro-convert CLI is a separate crate with its own deps (hf-hub, etc.) and never ships in the WASM.
The host owns bytes. ModelFile::parse(&[u8]) — same code path for files, fetches, mmap, or ArrayBuffer.
Ops are free functions. Layers are functions, not objects.

Honest limitations

Only the BERT encoder architecture is supported. No GPT, T5, Whisper, ViT, CLIP, or any decoder/encoder-decoder model yet.
No accent stripping (NFD + mark removal). Use *-cased multilingual vocabularies if your inputs include accents.
No batching. Encoding multiple sentences runs them sequentially.
CPU only. No WebGPU backend; matmul uses naive ikj with WASM SIMD128 inner kernels. Production-scale throughput is not the target.
No zero-config import. You must download the model, copy vocab.txt, and pass the config fields explicitly. Higher-level pipelines (à la pipeline('feature-extraction', '...')) are not provided.

If any of these matter for your use case, prefer transformers.js or Candle — they are far more feature-complete.

Roadmap

Project skeleton
Plain tensor + inline shape
Forward ops: matmul, linear, embedding, softmax, layernorm, GELU/SiLU/ReLU
safetensors loader with no serde
Multi-head attention + mean pooling
BERT encoder forward + from_safetensors
Numerical parity with HuggingFace on all-MiniLM-L6-v2 (1e-6)
HuggingFace → wasmicro converter CLI
WordPiece tokenizer with Unicode awareness (CJK split, Unicode case)
Multilingual parity test against bert-base-multilingual-cased (8/8)
Weight-only quantized linear ops: i8, affine u8, packed q4
Quantized BERT linear loading (i8, u8/q8)
Converter quantization pipeline (--quantize i8)
WASM SIMD128 kernels for matmul and matmul_t_b
End-to-end semantic-search verifier (text → embedding → ranking)
CI + GitHub Pages deploy workflow
WASM demo page
Published to crates.io and npm
Live demo with a downloadable model bundle on GitHub Pages
NFD accent-stripping path for uncased multilingual vocabularies
Zero-config import: wasmicro::embed("text") with auto-fetch of HF assets
Browser benchmark numbers (tokens/s on M-series, mid-tier x86, Android)
GPT-2 + KV-cache
WebGPU backend

License

MIT OR Apache-2.0

wasmicro 0.2.2