ruvllm-wasm
WASM bindings for browser-based LLM inference with WebGPU acceleration, SIMD optimizations, and intelligent routing.
Features
- WebGPU Acceleration - 10-50x faster inference with GPU compute shaders
- SIMD Optimizations - Vectorized operations for CPU fallback
- Web Workers - Parallel inference without blocking the main thread
- GGUF Support - Load quantized models (Q4, Q5, Q8) for efficient browser inference
- Streaming Tokens - Real-time token generation for responsive UX
- Intelligent Routing - HNSW Router, MicroLoRA, SONA for optimized inference
Installation
Add to your Cargo.toml:
[]
= "2.0"
Or build for WASM:
Quick Start
use ;
// Initialize with WebGPU (if available)
let llm = new.await?;
// Load a GGUF model
llm.load_model_from_url.await?;
// Generate text
let config = GenerationConfig ;
let result = llm.generate.await?;
println!;
JavaScript Usage
import init from 'ruvllm-wasm';
await ;
// Create instance with WebGPU
const llm = await ;
// Load model
await llm.;
// Generate with streaming
await llm.;
Features
WebGPU Acceleration
[]
= { = "2.0", = ["webgpu"] }
Enables GPU-accelerated inference using WebGPU compute shaders:
- Matrix multiplication kernels
- Attention computation
- 10-50x speedup on supported browsers
Parallel Inference
[]
= { = "2.0", = ["parallel"] }
Run inference in Web Workers:
- Non-blocking main thread
- Multiple concurrent requests
- Automatic worker pool management
SIMD Optimizations
[]
= { = "2.0", = ["simd"] }
Requires building with SIMD target:
RUSTFLAGS="-C target-feature=+simd128"
Intelligent Features
[]
= { = "2.0", = ["intelligent"] }
Enables advanced AI features:
- HNSW Router - Semantic routing for multi-model deployments
- MicroLoRA - Lightweight adapter injection
- SONA Instant - Self-optimizing neural adaptation
Browser Requirements
| Feature | Required | Benefit |
|---|---|---|
| WebAssembly | Yes | Core execution |
| WebGPU | No (recommended) | 10-50x faster |
| SharedArrayBuffer | No | Multi-threading |
| SIMD | No | 2-4x faster math |
Enable SharedArrayBuffer
Add these headers to your server:
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
Recommended Models
| Model | Size | Use Case |
|---|---|---|
| TinyLlama-1.1B-Q4 | ~700 MB | General chat |
| Phi-2-Q4 | ~1.6 GB | Code, reasoning |
| Qwen2-0.5B-Q4 | ~400 MB | Fast responses |
| StableLM-Zephyr-3B-Q4 | ~2 GB | Quality chat |
API Reference
RuvLLMWasm
Related Packages
- ruvllm - Core LLM runtime
- ruvllm-cli - CLI for model inference
- @ruvector/ruvllm-wasm - npm package
License
MIT OR Apache-2.0
Part of the RuVector ecosystem - High-performance vector database with self-learning capabilities.