Overview
whisper.apr is a pure Rust implementation of OpenAI's Whisper speech recognition model, engineered from the ground up for WebAssembly (WASM) deployment. It features a custom .apr model format optimized for browser streaming, SIMD acceleration, and int4/int8 quantization for efficient edge inference.
Key Differentiators
| Feature | whisper.apr | whisper.cpp | whisper-web |
|---|---|---|---|
| Pure Rust | Yes | C++ | JavaScript |
| WASM-First | Yes | Ported | Native |
| Int4 Quantization | Yes | Int8 only | No |
| Streaming Inference | Yes | Batch only | Limited |
| Zero-Copy Loading | Yes | No | No |
| Custom Format (.apr) | Yes | GGML | ONNX |
| Browser-Native | Yes | Emscripten | Yes |
Table of Contents
- Features
- Usage
- Installation
- Architecture
- Model Format
- Performance
- API Reference
- Demo Applications
- Running Examples
- Development
- Quality Metrics
- Contributing
- Roadmap
- License
Features
Core Capabilities
- Full Whisper Implementation: Encoder-decoder transformer with multi-head attention
- Multi-Language Support: 99 languages with automatic language detection
- Streaming Transcription: Real-time audio processing with chunked inference
- Translation Mode: Speech-to-English translation for all supported languages
Optimization Features
- WASM SIMD: Hardware-accelerated vector operations in browser
- Int4/Int8 Quantization: 4x-8x model size reduction with minimal accuracy loss
- Mixed-Precision Inference: Int4 weights with FP32 activations
- KV-Cache Optimization: Efficient autoregressive decoding
- Memory Pooling: Zero-allocation inference after warmup
Model Support
| Model | Parameters | .apr Size (Int4) | .apr Size (Int8) | RTF* |
|---|---|---|---|---|
| tiny | 39M | 20 MB | 39 MB | 0.3x |
| base | 74M | 37 MB | 74 MB | 0.5x |
| small | 244M | 122 MB | 244 MB | 0.8x |
| medium | 769M | 385 MB | 769 MB | 1.2x |
| large | 1.5B | 750 MB | 1.5 GB | 2.0x |
*RTF = Real-Time Factor on M1 MacBook (lower is faster)
Usage
Browser (WASM)
Rust
use ;
Streaming Transcription
use ;
let config = StreamingConfig ;
let mut processor = new;
// Feed audio chunks as they arrive
while let Some = audio_source.next_chunk
let final_result = processor.finalize?;
println!;
Installation
Prerequisites
- Rust 1.75+ with
wasm32-unknown-unknowntarget - wasm-pack (for WASM builds)
Building from Source
# Clone the repository
# Build native (for testing)
# Build WASM
# Run tests
Model Conversion
Convert existing Whisper models to .apr format:
# From safetensors (Hugging Face)
# With int4 quantization for smaller size
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ whisper.apr │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Audio │ │ Encoder │ │ Decoder │ │
│ │ Processing │──│ (6 layers) │──│ (6 layers) │──► Text │
│ │ │ │ │ │ │ │
│ │ • Resampling │ │ • Self-Attn │ │ • Self-Attn │ │
│ │ • Mel Spec │ │ • FFN │ │ • Cross-Attn │ │
│ │ • STFT │ │ • LayerNorm │ │ • FFN │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Tokenizer │ │ Quantization │ │ SIMD │ │
│ │ │ │ │ │ Primitives │ │
│ │ • BPE │ │ • Int4/Int8 │ │ │ │
│ │ • 51,865 tok │ │ • Mixed Prec │ │ • MatMul │ │
│ │ • Multi-lang │ │ • Zero-Copy │ │ • Softmax │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Module Overview
| Module | Description | LOC |
|---|---|---|
audio/ |
Mel spectrogram, resampling, streaming, filterbank | ~48,600 |
model/ |
Encoder, decoder, attention, quantization | ~21,800 |
wasm/ |
JavaScript bindings, Web Worker support | ~10,000 |
format/ |
.apr format, compression, streaming load | ~7,600 |
inference/ |
Greedy/beam search decoding, KV cache | ~2,600 |
tokenizer/ |
BPE tokenizer, vocabulary, special tokens | ~1,500 |
cuda/, cli/, backend/, etc. |
GPU, CLI, TUI, VAD, diarization | ~53,000 |
| Total (src/) | ~145,000 |
Model Format
The .apr (Aprender) format is optimized for streaming and browser deployment:
┌────────────────────────────────────────┐
│ APR File Structure │
├────────────────────────────────────────┤
│ Magic: "APR\0" (4 bytes) │
│ Version: u32 (4 bytes) │
│ Header Size: u32 (4 bytes) │
├────────────────────────────────────────┤
│ Model Config (JSON, compressed) │
│ • n_vocab, n_audio_ctx, n_audio_state │
│ • n_audio_head, n_audio_layer │
│ • n_text_ctx, n_text_state, ... │
├────────────────────────────────────────┤
│ Vocabulary (BPE tokens, compressed) │
├────────────────────────────────────────┤
│ Tensor Blocks (streaming-ready) │
│ • Block header (name, shape, dtype) │
│ • Compressed tensor data (zstd) │
│ • Quantization scales (if int4/int8) │
└────────────────────────────────────────┘
Format Benefits
- Streaming Load: Progressive tensor loading, start inference before full download
- Zero-Copy: Memory-mapped tensor access on native platforms
- Compression: Zstd compression for 30-50% smaller files
- Quantization Metadata: Embedded scales and zero-points for dequantization
Performance
Model Format Comparison
The .apr format is optimized for WASM delivery. Benchmark results for Whisper Tiny:
| Format | Size | Compression | WASM Ready |
|---|---|---|---|
| SafeTensors | 145 MB | baseline | ❌ Too large |
| GGML | 75 MB | 52% | ⚠️ Moderate |
| APR-f32 | 145 MB | 100% | ❌ Too large |
| APR-int8 | 37 MB | 25% | ✅ Excellent |
Loading Performance
| Metric | APR-f32 | APR-int8 | Improvement |
|---|---|---|---|
| File Read | 87ms | 21ms | 4x faster |
| Parse | 73ms | 19ms | 4x faster |
| Model Load | 490ms | 416ms | 15% faster |
| First Token | ~280ms | ~280ms | Same quality |
Run the benchmark yourself:
Runtime Benchmarks (whisper-tiny on 30s audio)
| Platform | Time | Memory | RTF |
|---|---|---|---|
| Native (M1 Mac) | 9.2s | 180 MB | 0.31x |
| Native (x86 AVX2) | 12.1s | 180 MB | 0.40x |
| WASM (Chrome) | 18.5s | 220 MB | 0.62x |
| WASM (Firefox) | 21.3s | 225 MB | 0.71x |
| WASM (Safari) | 24.1s | 230 MB | 0.80x |
Optimization Techniques
- SIMD Vectorization: 4x speedup on supported operations
- KV-Cache Reuse: 60% reduction in decoder compute
- Quantized MatMul: Int4 compute with FP32 accumulation
- Memory Pooling: Eliminates allocation overhead after warmup
- Batch Processing: Process multiple audio segments in parallel
API Reference
Core Types
/// Main model interface
/// Transcription options
/// Transcription result
WASM Bindings
// TypeScript definitions
export class WhisperModel {
static load(url: string): Promise<WhisperModel>;
transcribe(audio: Float32Array, options?: TranscribeOptions): Promise<TranscribeResult>;
translate(audio: Float32Array, options?: TranscribeOptions): Promise<TranscribeResult>;
detectLanguage(audio: Float32Array): Promise<DetectedLanguage>;
free(): void;
}
export interface TranscribeOptions {
language?: string;
task?: 'transcribe' | 'translate';
beamSize?: number;
temperature?: number;
}
export interface TranscribeResult {
text: string;
segments: Segment[];
language: string;
languageProbability: number;
}
Demo Applications
Zero-JavaScript demos showcasing whisper.apr capabilities. All demos are pure Rust/WASM with Probar serving (handles required COOP/COEP headers for SharedArrayBuffer):
&&
# Open http://localhost:8080
Available Demos
| Demo | Description |
|---|---|
| Real-Time Transcription | Live microphone transcription with streaming results |
| File Upload Transcription | Upload audio/video files with timeline visualization |
| Real-Time Translation | Live speech-to-English translation (99 languages) |
| File Upload Translation | Batch translation of uploaded media files |
Running Tests
&&
Running Examples
The examples/ directory contains 100+ examples demonstrating various features:
# Basic transcription
# Benchmark pipeline performance
# TUI-based benchmark visualization
# Format comparison (APR vs SafeTensors)
# Debug decoder output
# Profile encoder performance
# List all available examples
|
Example Categories
| Category | Examples | Description |
|---|---|---|
| Basic | basic_transcription, cli_usage |
Getting started |
| Benchmark | benchmark_pipeline, benchmark_tui |
Performance measurement |
| Debug | debug_decoder, debug_encoder_output |
Model debugging |
| Comparison | compare_hf_outputs, format_comparison |
Validation against reference |
| Pipeline | pipeline_tui, pipeline_falsification |
Full pipeline analysis |
Development
Project Structure
whisper.apr/
├── src/
│ ├── lib.rs # Library entry point
│ ├── audio/ # Audio processing
│ │ ├── mel.rs # Mel spectrogram
│ │ ├── resampler.rs # Audio resampling
│ │ ├── batch.rs # Batch preprocessing
│ │ └── streaming.rs # Streaming processor
│ ├── model/ # Neural network
│ │ ├── encoder.rs # Transformer encoder
│ │ ├── decoder.rs # Transformer decoder
│ │ ├── attention.rs # Multi-head attention
│ │ └── quantized.rs # Quantization support
│ ├── tokenizer/ # BPE tokenizer
│ ├── inference/ # Decoding strategies
│ ├── format/ # .apr format
│ └── wasm/ # WASM bindings
├── demos/ # Demo applications
├── benches/ # Criterion benchmarks
├── tests/ # Integration tests
└── docs/ # Documentation
Make Commands
Testing
# Unit tests
# Integration tests
# Property tests
# WASM tests (requires wasm-pack)
Quality Metrics
whisper.apr follows EXTREME TDD methodology with comprehensive quality gates.
PMAT-Verified Scores (via pmat tooling)
| Metric | Score | Grade |
|---|---|---|
| Rust Project Score | 156/159 | A+ (98.1%) |
| TDG (Technical Debt Grade) | 90.9/100 | A |
| Repository Health | 81.5/100 | B+ |
| Maintainability Index | 70.0 | — |
| Median Cyclomatic Complexity | 2.00 | — |
Codebase Statistics
| Metric | Value |
|---|---|
| Test Count | 2,273 |
| Total Functions | 933 |
| Source LOC | ~145,000 |
| Examples | 103 |
Quality Gate Configuration
# From .pmat-metrics.toml
[]
= 95.0 # Target
= 85.0 # Target
= 10 # Per function
= "A+" # Target
= 0 # Zero tolerance
[]
= 2.0 # ≤2.0x real-time
= 2.5 # ≤2.5x real-time
= 150 # Peak memory
= 350 # Peak memory
Toyota Way Principles
- Jidoka: Automatic quality gates prevent defects
- Kaizen: Continuous improvement through iteration
- Genchi Genbutsu: Tests verify actual behavior, not assumptions
Contributing
Contributions are welcome! Please follow these guidelines:
Development Workflow
- Fork the repository
- Create a feature branch from
master - Make your changes
- Run quality gates:
make lint && make test && make coverage - Ensure coverage remains above 95%
- Submit a pull request
Code Standards
- All code must pass
cargo clippy -- -D warnings - Format with
cargo fmt - No
unwrap()calls - useResulttypes - Zero TODO/FIXME/HACK comments - create tickets instead
- Document all public APIs
Testing Requirements
# Run all quality gates
Commit Messages
Follow conventional commits format:
feat(module): add new featurefix(module): fix bugrefactor(module): improve codedocs(module): update documentation
Roadmap
v0.2.0 (Current Release)
- Full Whisper architecture (encoder-decoder transformer)
- Int4/Int8 quantization with .apr format
- WASM SIMD acceleration
- Streaming transcription
- 99 language support with auto-detection
- Greedy and beam search decoding
- GPU-resident tensor architecture via trueno-gpu
- CUDA acceleration with 5.8x speedup
- 2,273+ tests, TDG A grade (90.9/100)
v0.3.0 (Planned)
- WebGPU acceleration
- Turbo model support
- Word-level timestamps
- Voice activity detection
License
Licensed under the MIT License. See LICENSE for details.