Overview
whisper.apr is a pure Rust implementation of OpenAI's Whisper speech recognition model, engineered from the ground up for WebAssembly (WASM) deployment. It features a custom .apr model format optimized for browser streaming, SIMD acceleration via trueno, and int4/int8 quantization for efficient edge inference. Also supports Moonshine ASR models and direct GGUF model loading.
Key Differentiators
| Feature | whisper.apr | whisper.cpp | whisper-web |
|---|---|---|---|
| Pure Rust | Yes | C++ | JavaScript |
| WASM-First | Yes | Ported | Native |
| Int4 Quantization | Yes | Int8 only | No |
| Streaming Inference | Yes | Batch only | Limited |
| Zero-Copy Loading | Yes | No | No |
| Custom Format (.apr) | Yes | GGML | ONNX |
| GGUF Loading | Yes | Native | No |
| Moonshine Support | Yes | No | No |
| Browser-Native | Yes | Emscripten | Yes |
Table of Contents
- Features
- Usage
- Installation
- Architecture
- Model Format
- Performance
- API Reference
- CLI
- Demo Applications
- Running Examples
- Development
- Quality Metrics
- Contributing
- Roadmap
- License
Features
Core Capabilities
- Full Whisper Implementation: Encoder-decoder transformer with multi-head attention
- Moonshine ASR: Lightweight alternative with GQA decoder and ConvStem encoder
- Multi-Language Support: 99 languages with automatic language detection
- Streaming Transcription: Real-time audio processing with chunked inference
- Translation Mode: Speech-to-English translation for all supported languages
- Multi-Format Audio: MP3, FLAC, OGG, AAC, M4A, WAV via symphonia
Optimization Features
- WASM SIMD: Hardware-accelerated vector operations in browser
- Int4/Int8 Quantization: 4x-8x model size reduction with minimal accuracy loss
- Mixed-Precision Inference: Int4 weights with FP32 activations
- KV-Cache Optimization: Efficient autoregressive decoding
- Tiled MatVec: 3.5x single-token decoding speedup
- Memory Pooling: Zero-allocation inference after warmup
Model Support
| Model | Parameters | Type | .apr Size (Int8) | Notes |
|---|---|---|---|---|
| tiny | 39M | Whisper | 39 MB | Fastest, English-focused |
| base | 74M | Whisper | 74 MB | Good balance |
| small | 244M | Whisper | 244 MB | High accuracy |
| large-v3-turbo | 809M | Whisper | ~800 MB | 32 enc + 4 dec layers |
| large | 1.5B | Whisper | 1.5 GB | Highest accuracy |
| moonshine-tiny | 27M | Moonshine | 27 MB | Ultra-lightweight |
| moonshine-base | 61M | Moonshine | 61 MB | Lightweight alternative |
Model Formats
| Format | Support | Notes |
|---|---|---|
| .apr | Native | Optimized for WASM streaming |
| .gguf | Direct load | Pre-quantized from HuggingFace |
| SafeTensors | Convert to .apr | Via built-in converter |
Usage
CLI Transcription
# Install
# Transcribe audio (auto-downloads model)
# Use specific model
# Boost domain vocabulary (safe with all model sizes)
# Use Moonshine model
# Load GGUF model directly
# Transcribe MP3/FLAC/OGG/M4A (auto-detected)
Browser (WASM)
Rust Library
use ;
Streaming Transcription
use ;
let config = StreamingConfig ;
let mut processor = new;
// Feed audio chunks as they arrive
while let Some = audio_source.next_chunk
let final_result = processor.finalize?;
println!;
Installation
Prerequisites
- Rust 1.75+ with
wasm32-unknown-unknowntarget - wasm-pack (for WASM builds)
From crates.io
# Library dependency
# CLI tool
Building from Source
# Clone the repository
# Build native (for testing)
# Build WASM
# Run tests
Model Conversion
Convert existing Whisper models to .apr format:
# From HuggingFace SafeTensors (auto-downloads)
# Or load GGUF models directly (no conversion needed)
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ whisper.apr │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Audio │ │ Encoder │ │ Decoder │ │
│ │ Processing │──│ Transformer │──│ Transformer │──► Text │
│ │ │ │ │ │ │ │
│ │ • Resampling │ │ • Self-Attn │ │ • Self-Attn │ │
│ │ • Mel Spec │ │ • FFN │ │ • Cross-Attn │ │
│ │ • Symphonia │ │ • LayerNorm │ │ • FFN / GQA │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Tokenizer │ │ Quantization │ │ SIMD │ │
│ │ │ │ │ │ (trueno) │ │
│ │ • BPE │ │ • Int4/Int8 │ │ │ │
│ │ • 51,865 tok │ │ • Mixed Prec │ │ • MatMul │ │
│ │ • Multi-lang │ │ • GGUF Q4-Q6 │ │ • Softmax │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Supported Model Architectures
| Architecture | Models | Decoder | Key Difference |
|---|---|---|---|
| Whisper | tiny, base, small, medium, large, large-v3-turbo | MHA + Cross-Attention | Standard encoder-decoder |
| Moonshine | moonshine-tiny, moonshine-base | GQA + Cross-Attention | ConvStem encoder, lighter |
Module Overview
| Module | Description |
|---|---|
audio/ |
Mel spectrogram, resampling, symphonia decoding, streaming |
model/ |
Whisper encoder/decoder, attention, quantization |
model/lfm2/ |
Moonshine GQA decoder with RoPE |
wasm/ |
JavaScript bindings, Web Worker support |
format/ |
.apr format, GGUF loader, compression, streaming load |
inference/ |
Greedy/beam search decoding, KV cache |
tokenizer/ |
BPE tokenizer, vocabulary, 99-language support |
detection/ |
Automatic language detection |
cli/ |
Command-line interface and model management |
Model Format
.apr Format
The .apr (Aprender) format is optimized for streaming and browser deployment:
┌────────────────────────────────────────┐
│ APR File Structure │
├────────────────────────────────────────┤
│ Magic: "APR\0" (4 bytes) │
│ Version: u32 (4 bytes) │
│ Header Size: u32 (4 bytes) │
├────────────────────────────────────────┤
│ Model Config (JSON, compressed) │
│ • n_vocab, n_audio_ctx, n_audio_state │
│ • n_audio_head, n_audio_layer │
│ • n_text_ctx, n_text_state, ... │
├────────────────────────────────────────┤
│ Vocabulary (BPE tokens, compressed) │
├────────────────────────────────────────┤
│ Tensor Blocks (streaming-ready) │
│ • Block header (name, shape, dtype) │
│ • Compressed tensor data (zstd) │
│ • Quantization scales (if int4/int8) │
└────────────────────────────────────────┘
GGUF Format (Direct Loading)
whisper.apr can load pre-quantized GGUF models from HuggingFace directly:
- Automatic tensor name remapping (whisper.cpp names to internal names)
- Model config inference from tensor shapes
- Supports Q4_0 through Q6_K, F16, and F32 quantization levels
- No conversion step needed
Format Comparison
| Feature | .apr | GGUF | SafeTensors |
|---|---|---|---|
| Streaming load | Yes | No | No |
| Browser-optimized | Yes | No | No |
| Pre-quantized | Yes | Yes | No |
| Direct loading | Yes | Yes | Convert needed |
| Compression | Zstd | None | None |
Performance
Runtime Benchmarks (whisper-tiny on 30s audio)
| Platform | Time | Memory | RTF |
|---|---|---|---|
| Native (M1 Mac) | 9.2s | 180 MB | 0.31x |
| Native (x86 AVX2) | 12.1s | 180 MB | 0.40x |
| WASM (Chrome) | 18.5s | 220 MB | 0.62x |
| WASM (Firefox) | 21.3s | 225 MB | 0.71x |
Key Optimizations
- Tiled MatVec: 3.5x speedup for single-token decoding via fast path in matmul_raw
- SIMD Vectorization: 4x speedup on supported operations via trueno
- KV-Cache Reuse: 60% reduction in decoder compute
- Quantized MatMul: Int4 compute with FP32 accumulation
- Memory Pooling: Eliminates allocation overhead after warmup
Performance Targets
| Model | Target RTF | Memory Peak |
|---|---|---|
| tiny | 2.0x | 150 MB |
| base | 2.5x | 350 MB |
| small | 4.0x | 800 MB |
API Reference
Core Types
/// Main model interface
/// Transcription options
/// Transcription result
WASM Bindings
// TypeScript definitions
export class WhisperModel {
static load(url: string): Promise<WhisperModel>;
transcribe(audio: Float32Array, options?: TranscribeOptions): Promise<TranscribeResult>;
translate(audio: Float32Array, options?: TranscribeOptions): Promise<TranscribeResult>;
detectLanguage(audio: Float32Array): Promise<DetectedLanguage>;
free(): void;
}
export interface TranscribeOptions {
language?: string;
task?: 'transcribe' | 'translate';
beamSize?: number;
temperature?: number;
}
export interface TranscribeResult {
text: string;
segments: Segment[];
language: string;
languageProbability: number;
}
CLI
The whisper-apr CLI provides transcription and debugging commands:
# Install CLI
# Transcribe audio
# Boost domain-specific vocabulary during decoding
# Probe model internals (forward-pass debugging)
# Check model configuration
# Run parity checks against reference implementations
# Verify installation
Supported Audio Formats
WAV, MP3, FLAC, OGG/Vorbis, AAC, M4A, MKV/WebM (via symphonia).
Demo Applications
Zero-JavaScript demos showcasing whisper.apr capabilities. All demos are pure Rust/WASM with Probar serving (handles required COOP/COEP headers for SharedArrayBuffer):
&&
# Open http://localhost:8080
Available Demos
| Demo | Description |
|---|---|
| Real-Time Transcription | Live microphone transcription with streaming results |
| File Upload Transcription | Upload audio/video files with timeline visualization |
| Real-Time Translation | Live speech-to-English translation (99 languages) |
| File Upload Translation | Batch translation of uploaded media files |
Running Tests
&&
Running Examples
The examples/ directory contains 100+ examples demonstrating various features:
# Basic transcription
# Benchmark pipeline performance
# TUI-based benchmark visualization
# Format comparison (APR vs SafeTensors)
# List all available examples
|
Development
Project Structure
whisper.apr/
├── src/
│ ├── lib.rs # Library entry point
│ ├── audio/ # Audio processing (mel, resampling, symphonia)
│ ├── model/ # Whisper encoder/decoder/attention
│ │ └── lfm2/ # Moonshine GQA decoder
│ ├── tokenizer/ # BPE tokenizer (51,865 tokens)
│ ├── inference/ # Greedy/beam search, KV cache
│ ├── format/ # .apr format + GGUF loader
│ ├── detection/ # Language detection (99 languages)
│ ├── cli/ # CLI commands
│ └── wasm/ # WASM bindings
├── demos/ # Browser demo applications
├── benches/ # Criterion benchmarks
├── tests/ # Integration tests
├── book/ # mdBook documentation
└── tools/ # Standalone converter
Make Commands
Testing
# Unit tests (fast, no large models)
# Integration tests (requires large models, feature-gated)
# WASM tests (requires wasm-pack)
Note: Integration tests that load large models are behind the
integration-testsfeature flag. Heavy lib tests that allocate large decoders are marked#[ignore]and skipped by default.
Quality Metrics
whisper.apr follows Extreme TDD methodology with comprehensive quality gates.
Current Scores (v0.2.4)
| Metric | Score | Grade |
|---|---|---|
| TDG (Technical Debt Grade) | 99.5/100 | A+ |
| Test Coverage | 96%+ | Above 95% target |
| Unit Tests | 2,885 | 0 failures |
| pmat Compliance | COMPLIANT | All gates passing |
| Quality Gate | PASSED | 0 violations |
| GitHub Issues | 0 open | All 15 closed |
Dependencies
| Crate | Version | Purpose |
|---|---|---|
| trueno | 0.16 | SIMD-accelerated tensor operations |
| aprender | 0.27 | .apr model format and GGUF parsing |
| realizar | 0.8 | Inference primitives (attention, quantization) |
Quality Gate Configuration
# From .pmat-metrics.toml
[]
= 95.0
= 85.0
= 40
= "A+"
= 0 # Zero tolerance
[]
= 2.0 # Real-time factor target
= 2.5
= 150 # Peak memory target
= 350
Contributing
Contributions are welcome! Please follow these guidelines:
Development Workflow
- Fork the repository
- Make your changes on
main - Run quality gates:
make lint && make test && make coverage - Ensure coverage remains above 95%
- Submit a pull request
Code Standards
- All code must pass
cargo clippy -- -D warnings - Format with
cargo fmt - No
unwrap()calls - useResulttypes - Zero TODO/FIXME/HACK comments - create tickets instead
- Document all public APIs
Testing Requirements
# Run all quality gates
Roadmap
v0.2.4 (Current Release)
- Full Whisper architecture (encoder-decoder transformer)
- Moonshine ASR model support (GQA decoder, ConvStem encoder)
- GGUF model loading (pre-quantized from HuggingFace)
- Large v3 Turbo model support (809M params)
- Int4/Int8 quantization with .apr format
- WASM SIMD acceleration via trueno
- Streaming transcription
- 99 language support with auto-detection
- Multi-format audio (MP3, FLAC, OGG, AAC, M4A)
- Greedy and beam search decoding
- 3.5x single-token decoding speedup (tiled_matvec)
- CLI with transcribe, probe, parity, config-check, selftest
- 2,885 tests, 96%+ coverage, TDG 99.5/100 A+
v0.3.0 (Planned)
- WebGPU acceleration
- Word-level timestamps
- Distil-Whisper model support
- Whisper v4 model support (when released)
License
Licensed under the MIT License. See LICENSE for details.