wasmicro 0.2.0

Tiny transformer inference for the web. One file. No build step.
Documentation
# wasmicro

**Tiny transformer inference for the web. One file. No build step.**

`wasmicro` runs transformer models (embeddings, classifiers, small LLMs) in
any JavaScript environment — browser, Node.js, Cloudflare Workers, Electron —
with a single small `.wasm` file. The same crate also runs natively, so the
same code powers your tests, your benchmarks, and your production website.

## Status

Pre-alpha. Working today:

- **Tensor core.** Owned `Tensor` with inline shape. No `Rc<RefCell>`, no
  autograd, no training state.
- **Forward ops.** `matmul`, `matmul_t_b`, `linear`, `embedding`, `softmax`,
  `layer_norm`, `relu`, `silu`, `gelu_tanh`, `gelu_erf`, elementwise math,
  multi-head self-attention, mean pooling, and weight-only quantized linear
  paths for `i8`, affine `u8`, and packed `q4` weights.
- **BERT encoder.** Full forward pass against the HuggingFace BERT weight
  layout (`bert-base-uncased`, `sentence-transformers/*`, etc.). Linear
  weights may be `F32`, `I8`, or affine `U8/q8` with companion scale tensors.
- **WordPiece tokenizer.** `WordPieceTokenizer::from_vocab_bytes(&[u8])`
  loads external `vocab.txt` bytes and produces `input_ids`,
  `token_type_ids`, and `attention_mask`.
- **Model loader.** `ModelFile::parse(&[u8])` reads safetensors with a
  hand-rolled JSON parser. No `serde`, no `serde_json` in the library.
- **Converter CLI.** `wasmicro-convert <hf-model-id> <out-dir>` downloads
  a model from the HuggingFace Hub, validates it, and can write an `i8` or
  `u8/q8` weight-only quantized BERT file.
- **WASM build + demo.** GitHub Actions builds the WASM bundle and deploys
  a live demo page on every push to `main`.

## Quick start (using wasmicro in another project)

The most convenient way is a **path dependency** while iterating locally:

```toml
[dependencies]
wasmicro = { path = "../wasmicro" }
```

A **git dependency** is just as easy:

```toml
[dependencies]
wasmicro = { git = "https://github.com/Xzdes/wasmicro" }
```

Once it is published, **crates.io** will be the recommended path:

```toml
[dependencies]
wasmicro = "0.2.0"
```

Use it:

```rust
use std::fs;
use wasmicro::{
    models::bert::{BertConfig, BertModel},
    ModelFile, WordPieceTokenizer,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model_bytes = fs::read("model.safetensors")?;
    let vocab_bytes = fs::read("vocab.txt")?;

    let file = ModelFile::parse(&model_bytes)?;
    let tokenizer = WordPieceTokenizer::from_vocab_bytes(&vocab_bytes)?;
    let config = BertConfig::mini_lm_l6_v2();
    let model = BertModel::from_safetensors(&file, config, "")?;

    let embedding = model.embed_text(&tokenizer, "hello world", 128)?;

    println!("embedding dim: {:?}", embedding.shape().as_slice());
    Ok(())
}
```

## Get a model

```bash
# Build the converter (one-time)
cargo build --release -p wasmicro-convert

# Download all-MiniLM-L6-v2 from the HuggingFace Hub
./target/release/wasmicro-convert \
    sentence-transformers/all-MiniLM-L6-v2 \
    ./models/mini-lm

# Optional: also write model.i8.safetensors with quantized BERT linear weights
./target/release/wasmicro-convert \
    sentence-transformers/all-MiniLM-L6-v2 \
    ./models/mini-lm \
    --quantize i8
```

Output:

```
models/mini-lm/
├── model.safetensors    (~ 87 MB, ready to pass to ModelFile::parse)
├── model.i8.safetensors (optional, when --quantize i8 is used)
├── config.json
├── vocab.txt
└── tokenizer.json
```

## Building

```bash
# Native — tests, benchmarks, examples.
cargo test --workspace
cargo run --example load_safetensors

# WASM bundle (browser, ES modules).
wasm-pack build --release --target web --no-opt \
    --out-dir demo/pkg --out-name wasmicro --features wasm
wasm-opt --enable-bulk-memory --enable-nontrapping-float-to-int -Oz \
    demo/pkg/wasmicro_bg.wasm -o demo/pkg/wasmicro_bg.wasm

# Repeatable size report for the WASM bundle and npm dry-run package.
powershell -ExecutionPolicy Bypass -File tools/measure-size.ps1

# Serve the demo locally
cd demo && python -m http.server 8080
```

## Demo

A live demo is built and deployed automatically by GitHub Actions on every
push to `main`. The workflow is at `.github/workflows/pages.yml`.

To enable Pages on your fork:
1. **Settings → Pages → Build and deployment → Source: GitHub Actions.**
2. Push to `main`. The `pages` workflow builds the WASM bundle, runs
   `wasm-opt -Oz`, and publishes `demo/` to Pages.

## Project layout

```
wasmicro/
├── src/                       # the library
│   ├── lib.rs
│   ├── tensor.rs              # owned f32 tensor + inline shape
│   ├── tokenizer.rs           # minimal WordPiece tokenizer
│   ├── quant.rs               # weight-only quantized storage types
│   ├── loader.rs              # safetensors parser (no serde)
│   ├── error.rs
│   ├── ops/                   # forward ops: matmul, attention, layernorm, ...
│   ├── models/
│   │   └── bert.rs            # BertModel + forward + from_safetensors
│   └── wasm.rs                # wasm-bindgen surface (feature = "wasm")
├── tools/
│   ├── wasmicro-convert/      # CLI to download & validate HF models
│   └── measure-size.ps1       # WASM/npm size report
├── tests/                     # integration tests via the public API
├── examples/                  # runnable demos
├── demo/                      # static site for GitHub Pages
└── .github/workflows/         # CI + Pages deploy
```

## Design rules

These are non-negotiable. Code that breaks them gets reverted.

1. **Tiny WASM bundle.** Target: < 250 KB after `wasm-opt -Oz`.
2. **Forward only.** No autograd, no optimizers, no training.
3. **Owned tensors.** `Vec<f32>`, no `Rc`, no `RefCell`.
4. **No heavy dependencies.** The library's default build pulls in only
   `bytemuck`. No `ndarray`, `candle`, `rayon`, `serde_json`, `chrono`.
   (The `wasmicro-convert` CLI is a separate crate — it can have any deps
   it likes.)
5. **The host owns bytes.** `ModelFile::parse(&[u8])` works for files,
   fetches, `mmap`, `ArrayBuffer` — all the same to us.
6. **Ops are free functions.** Layers are functions, not objects.

## Roadmap

- [x] Project skeleton
- [x] Plain tensor + shape
- [x] Forward ops: matmul, linear, embedding, softmax, layernorm, GELU/SiLU/ReLU
- [x] safetensors loader with no `serde`
- [x] Multi-head attention + mean pooling
- [x] BERT encoder forward + `from_safetensors`
- [x] HuggingFace → wasmicro converter CLI
- [x] CI + GitHub Pages deploy workflow
- [x] WASM demo page
- [x] WordPiece tokenizer from external `vocab.txt`
- [x] End-to-end semantic-search demo: text -> WordPiece -> BERT embeddings -> cosine ranking
- [x] Weight-only quantized linear ops: `i8`, affine `u8`, packed `q4`
- [x] Quantized BERT linear loading for `i8` and affine `u8/q8`
- [x] Repeatable WASM/npm size measurement script
- [ ] Real `all-MiniLM-L6-v2` semantic-search demo
- [x] Converter quantization pipeline for BERT linear weights
- [ ] WASM SIMD128 paths
- [ ] GPT-2 + KV-cache
- [ ] WebGPU backend

## License

MIT OR Apache-2.0