wasmicro
Tiny transformer inference for the web. One file. No build step.
wasmicro runs transformer models (embeddings, classifiers, small LLMs) in
any JavaScript environment — browser, Node.js, Cloudflare Workers, Electron —
with a single small .wasm file. The same crate also runs natively, so the
same code powers your tests, your benchmarks, and your production website.
Status
Pre-alpha. Working today:
- Tensor core. Owned
Tensorwith inline shape. NoRc<RefCell>, no autograd, no training state. - Forward ops.
matmul,matmul_t_b,linear,embedding,softmax,layer_norm,relu,silu,gelu_tanh,gelu_erf, elementwise math, multi-head self-attention, mean pooling, and weight-only quantized linear paths fori8, affineu8, and packedq4weights. - BERT encoder. Full forward pass against the HuggingFace BERT weight
layout (
bert-base-uncased,sentence-transformers/*, etc.). Linear weights may beF32,I8, or affineU8/q8with companion scale tensors. - WordPiece tokenizer.
WordPieceTokenizer::from_vocab_bytes(&[u8])loads externalvocab.txtbytes and producesinput_ids,token_type_ids, andattention_mask. - Model loader.
ModelFile::parse(&[u8])reads safetensors with a hand-rolled JSON parser. Noserde, noserde_jsonin the library. - Converter CLI.
wasmicro-convert <hf-model-id> <out-dir>downloads a model from the HuggingFace Hub, validates it, and can write ani8oru8/q8weight-only quantized BERT file. - WASM build + demo. GitHub Actions builds the WASM bundle and deploys
a live demo page on every push to
main.
Quick start (using wasmicro in another project)
The most convenient way is a path dependency while iterating locally:
[]
= { = "../wasmicro" }
A git dependency is just as easy:
[]
= { = "https://github.com/Xzdes/wasmicro" }
Once it is published, crates.io will be the recommended path:
[]
= "0.1.0"
Use it:
use fs;
use ;
Get a model
# Build the converter (one-time)
# Download all-MiniLM-L6-v2 from the HuggingFace Hub
# Optional: also write model.i8.safetensors with quantized BERT linear weights
Output:
models/mini-lm/
├── model.safetensors (~ 87 MB, ready to pass to ModelFile::parse)
├── model.i8.safetensors (optional, when --quantize i8 is used)
├── config.json
├── vocab.txt
└── tokenizer.json
Building
# Native — tests, benchmarks, examples.
# WASM bundle (browser, ES modules).
# Repeatable size report for the WASM bundle and npm dry-run package.
# Serve the demo locally
&&
Demo
A live demo is built and deployed automatically by GitHub Actions on every
push to main. The workflow is at .github/workflows/pages.yml.
To enable Pages on your fork:
- Settings → Pages → Build and deployment → Source: GitHub Actions.
- Push to
main. Thepagesworkflow builds the WASM bundle, runswasm-opt -Oz, and publishesdemo/to Pages.
Project layout
wasmicro/
├── src/ # the library
│ ├── lib.rs
│ ├── tensor.rs # owned f32 tensor + inline shape
│ ├── tokenizer.rs # minimal WordPiece tokenizer
│ ├── quant.rs # weight-only quantized storage types
│ ├── loader.rs # safetensors parser (no serde)
│ ├── error.rs
│ ├── ops/ # forward ops: matmul, attention, layernorm, ...
│ ├── models/
│ │ └── bert.rs # BertModel + forward + from_safetensors
│ └── wasm.rs # wasm-bindgen surface (feature = "wasm")
├── tools/
│ ├── wasmicro-convert/ # CLI to download & validate HF models
│ └── measure-size.ps1 # WASM/npm size report
├── tests/ # integration tests via the public API
├── examples/ # runnable demos
├── demo/ # static site for GitHub Pages
└── .github/workflows/ # CI + Pages deploy
Design rules
These are non-negotiable. Code that breaks them gets reverted.
- Tiny WASM bundle. Target: < 250 KB after
wasm-opt -Oz. - Forward only. No autograd, no optimizers, no training.
- Owned tensors.
Vec<f32>, noRc, noRefCell. - No heavy dependencies. The library's default build pulls in only
bytemuck. Nondarray,candle,rayon,serde_json,chrono. (Thewasmicro-convertCLI is a separate crate — it can have any deps it likes.) - The host owns bytes.
ModelFile::parse(&[u8])works for files, fetches,mmap,ArrayBuffer— all the same to us. - Ops are free functions. Layers are functions, not objects.
Roadmap
- Project skeleton
- Plain tensor + shape
- Forward ops: matmul, linear, embedding, softmax, layernorm, GELU/SiLU/ReLU
- safetensors loader with no
serde - Multi-head attention + mean pooling
- BERT encoder forward +
from_safetensors - HuggingFace → wasmicro converter CLI
- CI + GitHub Pages deploy workflow
- WASM demo page
- WordPiece tokenizer from external
vocab.txt - End-to-end semantic-search demo: text -> WordPiece -> BERT embeddings -> cosine ranking
- Weight-only quantized linear ops:
i8, affineu8, packedq4 - Quantized BERT linear loading for
i8and affineu8/q8 - Repeatable WASM/npm size measurement script
- Real
all-MiniLM-L6-v2semantic-search demo - Converter quantization pipeline for BERT linear weights
- WASM SIMD128 paths
- GPT-2 + KV-cache
- WebGPU backend
License
MIT OR Apache-2.0