ruvllm-wasm 2.0.0

WASM bindings for RuvLLM - browser-compatible LLM inference runtime with WebGPU acceleration
Documentation

ruvllm-wasm

Crates.io Documentation License

WASM bindings for browser-based LLM inference with WebGPU acceleration, SIMD optimizations, and intelligent routing.

Features

  • WebGPU Acceleration - 10-50x faster inference with GPU compute shaders
  • SIMD Optimizations - Vectorized operations for CPU fallback
  • Web Workers - Parallel inference without blocking the main thread
  • GGUF Support - Load quantized models (Q4, Q5, Q8) for efficient browser inference
  • Streaming Tokens - Real-time token generation for responsive UX
  • Intelligent Routing - HNSW Router, MicroLoRA, SONA for optimized inference

Installation

Add to your Cargo.toml:

[dependencies]
ruvllm-wasm = "2.0"

Or build for WASM:

wasm-pack build --target web --release

Quick Start

use ruvllm_wasm::{RuvLLMWasm, GenerationConfig};

// Initialize with WebGPU (if available)
let llm = RuvLLMWasm::new(true).await?;

// Load a GGUF model
llm.load_model_from_url("https://example.com/model.gguf").await?;

// Generate text
let config = GenerationConfig {
    max_tokens: 100,
    temperature: 0.7,
    top_p: 0.9,
    ..Default::default()
};

let result = llm.generate("What is the capital of France?", &config).await?;
println!("{}", result.text);

JavaScript Usage

import init, { RuvLLMWasm } from 'ruvllm-wasm';

await init();

// Create instance with WebGPU
const llm = await RuvLLMWasm.new(true);

// Load model
await llm.load_model_from_url('https://example.com/model.gguf', (loaded, total) => {
  console.log(`Loading: ${Math.round(loaded / total * 100)}%`);
});

// Generate with streaming
await llm.generate_stream('Tell me a story', {
  max_tokens: 200,
  temperature: 0.8,
}, (token) => {
  process.stdout.write(token);
});

Features

WebGPU Acceleration

[dependencies]
ruvllm-wasm = { version = "2.0", features = ["webgpu"] }

Enables GPU-accelerated inference using WebGPU compute shaders:

  • Matrix multiplication kernels
  • Attention computation
  • 10-50x speedup on supported browsers

Parallel Inference

[dependencies]
ruvllm-wasm = { version = "2.0", features = ["parallel"] }

Run inference in Web Workers:

  • Non-blocking main thread
  • Multiple concurrent requests
  • Automatic worker pool management

SIMD Optimizations

[dependencies]
ruvllm-wasm = { version = "2.0", features = ["simd"] }

Requires building with SIMD target:

RUSTFLAGS="-C target-feature=+simd128" wasm-pack build --target web

Intelligent Features

[dependencies]
ruvllm-wasm = { version = "2.0", features = ["intelligent"] }

Enables advanced AI features:

  • HNSW Router - Semantic routing for multi-model deployments
  • MicroLoRA - Lightweight adapter injection
  • SONA Instant - Self-optimizing neural adaptation

Browser Requirements

Feature Required Benefit
WebAssembly Yes Core execution
WebGPU No (recommended) 10-50x faster
SharedArrayBuffer No Multi-threading
SIMD No 2-4x faster math

Enable SharedArrayBuffer

Add these headers to your server:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Recommended Models

Model Size Use Case
TinyLlama-1.1B-Q4 ~700 MB General chat
Phi-2-Q4 ~1.6 GB Code, reasoning
Qwen2-0.5B-Q4 ~400 MB Fast responses
StableLM-Zephyr-3B-Q4 ~2 GB Quality chat

API Reference

RuvLLMWasm

impl RuvLLMWasm {
    /// Create a new instance
    pub async fn new(use_webgpu: bool) -> Result<Self, JsValue>;

    /// Load model from URL
    pub async fn load_model_from_url(&self, url: &str) -> Result<(), JsValue>;

    /// Load model from bytes
    pub async fn load_model_from_bytes(&self, bytes: &[u8]) -> Result<(), JsValue>;

    /// Generate text completion
    pub async fn generate(&self, prompt: &str, config: &GenerationConfig) -> Result<GenerationResult, JsValue>;

    /// Generate with streaming callback
    pub async fn generate_stream(&self, prompt: &str, config: &GenerationConfig, callback: js_sys::Function) -> Result<GenerationResult, JsValue>;

    /// Check WebGPU availability
    pub async fn check_webgpu() -> WebGPUStatus;

    /// Get browser capabilities
    pub async fn get_capabilities() -> BrowserCapabilities;

    /// Unload model and free memory
    pub fn unload(&self);
}

Related Packages

License

MIT OR Apache-2.0


Part of the RuVector ecosystem - High-performance vector database with self-learning capabilities.