wasmicro

Tiny transformer inference for the web. One file. No build step.

wasmicro runs transformer models (embeddings, classifiers, small LLMs) in any JavaScript environment — browser, Node.js, Cloudflare Workers, Electron — with a single small .wasm file. The same crate also runs natively, so the same code powers your tests, your benchmarks, and your production website.

Status

Pre-alpha. Working today:

Tensor core. Owned Tensor with inline shape. No Rc<RefCell>, no autograd, no training state.
Forward ops. matmul, matmul_t_b, linear, embedding, softmax, layer_norm, relu, silu, gelu_tanh, gelu_erf, elementwise math, multi-head self-attention, mean pooling, and weight-only quantized linear paths for i8, affine u8, and packed q4 weights.
BERT encoder. Full forward pass against the HuggingFace BERT weight layout (bert-base-uncased, sentence-transformers/*, etc.). Linear weights may be F32, I8, or affine U8/q8 with companion scale tensors.
WordPiece tokenizer. WordPieceTokenizer::from_vocab_bytes(&[u8]) loads external vocab.txt bytes and produces input_ids, token_type_ids, and attention_mask.
Model loader. ModelFile::parse(&[u8]) reads safetensors with a hand-rolled JSON parser. No serde, no serde_json in the library.
Converter CLI. wasmicro-convert <hf-model-id> <out-dir> downloads a model from the HuggingFace Hub, validates it, and can write an i8 or u8/q8 weight-only quantized BERT file.
WASM build + demo. GitHub Actions builds the WASM bundle and deploys a live demo page on every push to main.

Quick start (using wasmicro in another project)

The most convenient way is a path dependency while iterating locally:

[dependencies]
wasmicro = { path = "../wasmicro" }

A git dependency is just as easy:

[dependencies]
wasmicro = { git = "https://github.com/Xzdes/wasmicro" }

Once it is published, crates.io will be the recommended path:

[dependencies]
wasmicro = "0.2.0"

Use it:

use std::fs;
use wasmicro::{
    models::bert::{BertConfig, BertModel},
    ModelFile, WordPieceTokenizer,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model_bytes = fs::read("model.safetensors")?;
    let vocab_bytes = fs::read("vocab.txt")?;

    let file = ModelFile::parse(&model_bytes)?;
    let tokenizer = WordPieceTokenizer::from_vocab_bytes(&vocab_bytes)?;
    let config = BertConfig::mini_lm_l6_v2();
    let model = BertModel::from_safetensors(&file, config, "")?;

    let embedding = model.embed_text(&tokenizer, "hello world", 128)?;

    println!("embedding dim: {:?}", embedding.shape().as_slice());
    Ok(())
}

Get a model

# Build the converter (one-time)
cargo build --release -p wasmicro-convert

# Download all-MiniLM-L6-v2 from the HuggingFace Hub
./target/release/wasmicro-convert \
    sentence-transformers/all-MiniLM-L6-v2 \
    ./models/mini-lm

# Optional: also write model.i8.safetensors with quantized BERT linear weights
./target/release/wasmicro-convert \
    sentence-transformers/all-MiniLM-L6-v2 \
    ./models/mini-lm \
    --quantize i8

Output:

models/mini-lm/
├── model.safetensors    (~ 87 MB, ready to pass to ModelFile::parse)
├── model.i8.safetensors (optional, when --quantize i8 is used)
├── config.json
├── vocab.txt
└── tokenizer.json

Building

# Native — tests, benchmarks, examples.
cargo test --workspace
cargo run --example load_safetensors

# WASM bundle (browser, ES modules).
wasm-pack build --release --target web --no-opt \
    --out-dir demo/pkg --out-name wasmicro --features wasm
wasm-opt --enable-bulk-memory --enable-nontrapping-float-to-int -Oz \
    demo/pkg/wasmicro_bg.wasm -o demo/pkg/wasmicro_bg.wasm

# Repeatable size report for the WASM bundle and npm dry-run package.
powershell -ExecutionPolicy Bypass -File tools/measure-size.ps1

# Serve the demo locally
cd demo && python -m http.server 8080

Demo

A live demo is built and deployed automatically by GitHub Actions on every push to main. The workflow is at .github/workflows/pages.yml.

To enable Pages on your fork:

Settings → Pages → Build and deployment → Source: GitHub Actions.
Push to main. The pages workflow builds the WASM bundle, runs wasm-opt -Oz, and publishes demo/ to Pages.

Project layout

wasmicro/
├── src/                       # the library
│   ├── lib.rs
│   ├── tensor.rs              # owned f32 tensor + inline shape
│   ├── tokenizer.rs           # minimal WordPiece tokenizer
│   ├── quant.rs               # weight-only quantized storage types
│   ├── loader.rs              # safetensors parser (no serde)
│   ├── error.rs
│   ├── ops/                   # forward ops: matmul, attention, layernorm, ...
│   ├── models/
│   │   └── bert.rs            # BertModel + forward + from_safetensors
│   └── wasm.rs                # wasm-bindgen surface (feature = "wasm")
├── tools/
│   ├── wasmicro-convert/      # CLI to download & validate HF models
│   └── measure-size.ps1       # WASM/npm size report
├── tests/                     # integration tests via the public API
├── examples/                  # runnable demos
├── demo/                      # static site for GitHub Pages
└── .github/workflows/         # CI + Pages deploy

Design rules

These are non-negotiable. Code that breaks them gets reverted.

Tiny WASM bundle. Target: < 250 KB after wasm-opt -Oz.
Forward only. No autograd, no optimizers, no training.
Owned tensors. Vec<f32>, no Rc, no RefCell.
No heavy dependencies. The library's default build pulls in only bytemuck. No ndarray, candle, rayon, serde_json, chrono. (The wasmicro-convert CLI is a separate crate — it can have any deps it likes.)
The host owns bytes. ModelFile::parse(&[u8]) works for files, fetches, mmap, ArrayBuffer — all the same to us.
Ops are free functions. Layers are functions, not objects.

Roadmap

Project skeleton
Plain tensor + shape
Forward ops: matmul, linear, embedding, softmax, layernorm, GELU/SiLU/ReLU
safetensors loader with no serde
Multi-head attention + mean pooling
BERT encoder forward + from_safetensors
HuggingFace → wasmicro converter CLI
CI + GitHub Pages deploy workflow
WASM demo page
WordPiece tokenizer from external vocab.txt
End-to-end semantic-search demo: text -> WordPiece -> BERT embeddings -> cosine ranking
Weight-only quantized linear ops: i8, affine u8, packed q4
Quantized BERT linear loading for i8 and affine u8/q8
Repeatable WASM/npm size measurement script
Real all-MiniLM-L6-v2 semantic-search demo
Converter quantization pipeline for BERT linear weights
WASM SIMD128 paths
GPT-2 + KV-cache
WebGPU backend

License

MIT OR Apache-2.0

wasmicro 0.2.0