whisper-apr 0.2.3

Overview

whisper.apr is a pure Rust implementation of OpenAI's Whisper speech recognition model, engineered from the ground up for WebAssembly (WASM) deployment. It features a custom .apr model format optimized for browser streaming, SIMD acceleration, and int4/int8 quantization for efficient edge inference.

Key Differentiators

Feature	whisper.apr	whisper.cpp	whisper-web
Pure Rust	Yes	C++	JavaScript
WASM-First	Yes	Ported	Native
Int4 Quantization	Yes	Int8 only	No
Streaming Inference	Yes	Batch only	Limited
Zero-Copy Loading	Yes	No	No
Custom Format (.apr)	Yes	GGML	ONNX
Browser-Native	Yes	Emscripten	Yes

Features
Usage
Installation
Architecture
Model Format
Performance
API Reference
Demo Applications
Running Examples
Development
Quality Metrics
Contributing
Roadmap
License

Features

Core Capabilities

Full Whisper Implementation: Encoder-decoder transformer with multi-head attention
Multi-Language Support: 99 languages with automatic language detection
Streaming Transcription: Real-time audio processing with chunked inference
Translation Mode: Speech-to-English translation for all supported languages

Optimization Features

WASM SIMD: Hardware-accelerated vector operations in browser
Int4/Int8 Quantization: 4x-8x model size reduction with minimal accuracy loss
Mixed-Precision Inference: Int4 weights with FP32 activations
KV-Cache Optimization: Efficient autoregressive decoding
Memory Pooling: Zero-allocation inference after warmup

Model Support

Model	Parameters	.apr Size (Int4)	.apr Size (Int8)	RTF*
tiny	39M	20 MB	39 MB	0.3x
base	74M	37 MB	74 MB	0.5x
small	244M	122 MB	244 MB	0.8x
medium	769M	385 MB	769 MB	1.2x
large	1.5B	750 MB	1.5 GB	2.0x

*RTF = Real-Time Factor on M1 MacBook (lower is faster)

Usage

Browser (WASM)

<script type="module">
  import init, { WhisperModel } from './whisper_apr.js';

  async function transcribe() {
    await init();

    const model = await WhisperModel.load('/models/whisper-tiny.apr');
    const audioData = await fetchAudioAsFloat32Array('/audio/sample.wav');

    const result = await model.transcribe(audioData);
    console.log(result.text);
  }
</script>

Rust

use whisper_apr::{WhisperModel, TranscribeOptions};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model = WhisperModel::load("whisper-tiny.apr")?;

    let audio = whisper_apr::load_audio("sample.wav")?;
    let result = model.transcribe(&audio, TranscribeOptions::default())?;

    println!("{}", result.text);
    Ok(())
}

Streaming Transcription

use whisper_apr::{StreamingProcessor, StreamingConfig};

let config = StreamingConfig {
    chunk_duration_ms: 5000,
    overlap_ms: 500,
    language: Some("en"),
};

let mut processor = StreamingProcessor::new(model, config);

// Feed audio chunks as they arrive
while let Some(chunk) = audio_source.next_chunk() {
    if let Some(partial) = processor.process_chunk(&chunk)? {
        println!("Partial: {}", partial.text);
    }
}

let final_result = processor.finalize()?;
println!("Final: {}", final_result.text);

Installation

Prerequisites

Rust 1.75+ with wasm32-unknown-unknown target
wasm-pack (for WASM builds)

Building from Source

# Clone the repository
git clone https://github.com/paiml/whisper.apr.git
cd whisper.apr

# Build native (for testing)
cargo build --release

# Build WASM
make wasm

# Run tests
cargo test

Model Conversion

Convert existing Whisper models to .apr format:

# From safetensors (Hugging Face)
cargo run --bin convert -- \
  --input openai/whisper-tiny \
  --output whisper-tiny.apr \
  --quantize int8

# With int4 quantization for smaller size
cargo run --bin convert -- \
  --input openai/whisper-small \
  --output whisper-small-int4.apr \
  --quantize int4

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        whisper.apr                               │
├─────────────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │    Audio     │  │   Encoder    │  │   Decoder    │          │
│  │  Processing  │──│  (6 layers)  │──│  (6 layers)  │──► Text  │
│  │              │  │              │  │              │          │
│  │ • Resampling │  │ • Self-Attn  │  │ • Self-Attn  │          │
│  │ • Mel Spec   │  │ • FFN        │  │ • Cross-Attn │          │
│  │ • STFT       │  │ • LayerNorm  │  │ • FFN        │          │
│  └──────────────┘  └──────────────┘  └──────────────┘          │
├─────────────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │  Tokenizer   │  │ Quantization │  │    SIMD      │          │
│  │              │  │              │  │  Primitives  │          │
│  │ • BPE        │  │ • Int4/Int8  │  │              │          │
│  │ • 51,865 tok │  │ • Mixed Prec │  │ • MatMul     │          │
│  │ • Multi-lang │  │ • Zero-Copy  │  │ • Softmax    │          │
│  └──────────────┘  └──────────────┘  └──────────────┘          │
└─────────────────────────────────────────────────────────────────┘

Module Overview

Module	Description	LOC
`audio/`	Mel spectrogram, resampling, streaming, filterbank	~48,600
`model/`	Encoder, decoder, attention, quantization	~21,800
`wasm/`	JavaScript bindings, Web Worker support	~10,000
`format/`	.apr format, compression, streaming load	~7,600
`inference/`	Greedy/beam search decoding, KV cache	~2,600
`tokenizer/`	BPE tokenizer, vocabulary, special tokens	~1,500
`cuda/`, `cli/`, `backend/`, etc.	GPU, CLI, TUI, VAD, diarization	~53,000
Total (src/)		~145,000

Model Format

The .apr (Aprender) format is optimized for streaming and browser deployment:

┌────────────────────────────────────────┐
│           APR File Structure            │
├────────────────────────────────────────┤
│ Magic: "APR\0" (4 bytes)               │
│ Version: u32 (4 bytes)                 │
│ Header Size: u32 (4 bytes)             │
├────────────────────────────────────────┤
│ Model Config (JSON, compressed)        │
│ • n_vocab, n_audio_ctx, n_audio_state  │
│ • n_audio_head, n_audio_layer          │
│ • n_text_ctx, n_text_state, ...        │
├────────────────────────────────────────┤
│ Vocabulary (BPE tokens, compressed)    │
├────────────────────────────────────────┤
│ Tensor Blocks (streaming-ready)        │
│ • Block header (name, shape, dtype)    │
│ • Compressed tensor data (zstd)        │
│ • Quantization scales (if int4/int8)   │
└────────────────────────────────────────┘

Format Benefits

Streaming Load: Progressive tensor loading, start inference before full download
Zero-Copy: Memory-mapped tensor access on native platforms
Compression: Zstd compression for 30-50% smaller files
Quantization Metadata: Embedded scales and zero-points for dequantization

Performance

Model Format Comparison

The .apr format is optimized for WASM delivery. Benchmark results for Whisper Tiny:

Format	Size	Compression	WASM Ready
SafeTensors	145 MB	baseline	❌ Too large
GGML	75 MB	52%	⚠️ Moderate
APR-f32	145 MB	100%	❌ Too large
APR-int8	37 MB	25%	✅ Excellent

Loading Performance

Metric	APR-f32	APR-int8	Improvement
File Read	87ms	21ms	4x faster
Parse	73ms	19ms	4x faster
Model Load	490ms	416ms	15% faster
First Token	~280ms	~280ms	Same quality

Run the benchmark yourself:

cargo run --example format_comparison --release

Runtime Benchmarks (whisper-tiny on 30s audio)

Platform	Time	Memory	RTF
Native (M1 Mac)	9.2s	180 MB	0.31x
Native (x86 AVX2)	12.1s	180 MB	0.40x
WASM (Chrome)	18.5s	220 MB	0.62x
WASM (Firefox)	21.3s	225 MB	0.71x
WASM (Safari)	24.1s	230 MB	0.80x

Optimization Techniques

SIMD Vectorization: 4x speedup on supported operations
KV-Cache Reuse: 60% reduction in decoder compute
Quantized MatMul: Int4 compute with FP32 accumulation
Memory Pooling: Eliminates allocation overhead after warmup
Batch Processing: Process multiple audio segments in parallel

API Reference

Core Types

/// Main model interface
pub struct WhisperModel { /* ... */ }

impl WhisperModel {
    /// Load model from .apr file
    pub fn load(path: impl AsRef<Path>) -> WhisperResult<Self>;

    /// Load with custom options
    pub fn load_with_options(path: impl AsRef<Path>, opts: LoadOptions) -> WhisperResult<Self>;

    /// Transcribe audio samples (f32, 16kHz mono)
    pub fn transcribe(&self, audio: &[f32], opts: TranscribeOptions) -> WhisperResult<TranscribeResult>;

    /// Translate to English
    pub fn translate(&self, audio: &[f32], opts: TranscribeOptions) -> WhisperResult<TranscribeResult>;

    /// Detect language
    pub fn detect_language(&self, audio: &[f32]) -> WhisperResult<DetectedLanguage>;
}

/// Transcription options
pub struct TranscribeOptions {
    pub language: Option<String>,      // Force language (None = auto-detect)
    pub task: Task,                    // Transcribe or Translate
    pub beam_size: usize,              // Beam search width (1 = greedy)
    pub best_of: usize,                // Sample multiple and pick best
    pub temperature: f32,              // Sampling temperature
    pub compression_ratio_threshold: f32,
    pub logprob_threshold: f32,
    pub no_speech_threshold: f32,
}

/// Transcription result
pub struct TranscribeResult {
    pub text: String,
    pub segments: Vec<Segment>,
    pub language: String,
    pub language_probability: f32,
}

WASM Bindings

// TypeScript definitions
export class WhisperModel {
  static load(url: string): Promise<WhisperModel>;
  transcribe(audio: Float32Array, options?: TranscribeOptions): Promise<TranscribeResult>;
  translate(audio: Float32Array, options?: TranscribeOptions): Promise<TranscribeResult>;
  detectLanguage(audio: Float32Array): Promise<DetectedLanguage>;
  free(): void;
}

export interface TranscribeOptions {
  language?: string;
  task?: 'transcribe' | 'translate';
  beamSize?: number;
  temperature?: number;
}

export interface TranscribeResult {
  text: string;
  segments: Segment[];
  language: string;
  languageProbability: number;
}

Demo Applications

Zero-JavaScript demos showcasing whisper.apr capabilities. All demos are pure Rust/WASM with Probar serving (handles required COOP/COEP headers for SharedArrayBuffer):

cd demos && probar serve
# Open http://localhost:8080

Available Demos

Demo	Description
Real-Time Transcription	Live microphone transcription with streaming results
File Upload Transcription	Upload audio/video files with timeline visualization
Real-Time Translation	Live speech-to-English translation (99 languages)
File Upload Translation	Batch translation of uploaded media files

Running Tests

cd demos && probar test -v    # Run all demo tests
probar coverage               # Pixel regression tests

Running Examples

The examples/ directory contains 100+ examples demonstrating various features:

# Basic transcription
cargo run --example basic_transcription --release

# Benchmark pipeline performance
cargo run --example benchmark_pipeline --release

# TUI-based benchmark visualization
cargo run --example benchmark_tui --release --features tui

# Format comparison (APR vs SafeTensors)
cargo run --example format_comparison --release

# Debug decoder output
cargo run --example debug_decoder --release

# Profile encoder performance
cargo run --example profile_encoder --release

# List all available examples
ls examples/*.rs | xargs -I {} basename {} .rs

Example Categories

Category	Examples	Description
Basic	`basic_transcription`, `cli_usage`	Getting started
Benchmark	`benchmark_pipeline`, `benchmark_tui`	Performance measurement
Debug	`debug_decoder`, `debug_encoder_output`	Model debugging
Comparison	`compare_hf_outputs`, `format_comparison`	Validation against reference
Pipeline	`pipeline_tui`, `pipeline_falsification`	Full pipeline analysis

Development

Project Structure

whisper.apr/
├── src/
│   ├── lib.rs              # Library entry point
│   ├── audio/              # Audio processing
│   │   ├── mel.rs          # Mel spectrogram
│   │   ├── resampler.rs    # Audio resampling
│   │   ├── batch.rs        # Batch preprocessing
│   │   └── streaming.rs    # Streaming processor
│   ├── model/              # Neural network
│   │   ├── encoder.rs      # Transformer encoder
│   │   ├── decoder.rs      # Transformer decoder
│   │   ├── attention.rs    # Multi-head attention
│   │   └── quantized.rs    # Quantization support
│   ├── tokenizer/          # BPE tokenizer
│   ├── inference/          # Decoding strategies
│   ├── format/             # .apr format
│   └── wasm/               # WASM bindings
├── demos/                  # Demo applications
├── benches/                # Criterion benchmarks
├── tests/                  # Integration tests
└── docs/                   # Documentation

Make Commands

make build      # Build release
make wasm       # Build WASM package
make test       # Run all tests
make bench      # Run benchmarks
make lint       # Clippy + fmt check
make coverage   # Generate coverage report
make docs       # Build documentation

Testing

# Unit tests
cargo test --lib

# Integration tests
cargo test --test integration

# Property tests
cargo test --test property_tests

# WASM tests (requires wasm-pack)
wasm-pack test --headless --chrome

Quality Metrics

whisper.apr follows EXTREME TDD methodology with comprehensive quality gates.

PMAT-Verified Scores (via `pmat` tooling)

Metric	Score	Grade
Rust Project Score	156/159	A+ (98.1%)
TDG (Technical Debt Grade)	90.9/100	A
Repository Health	81.5/100	B+
Maintainability Index	70.0	—
Median Cyclomatic Complexity	2.00	—

Codebase Statistics

Metric	Value
Test Count	2,273
Total Functions	933
Source LOC	~145,000
Examples	103

Quality Gate Configuration

# From .pmat-metrics.toml
[quality_gates]
min_coverage_pct = 95.0           # Target
min_mutation_score_pct = 85.0     # Target
max_cyclomatic_complexity = 10    # Per function
min_tdg_grade = "A+"              # Target
max_unwrap_calls = 0              # Zero tolerance

[performance]
max_rtf_tiny = 2.0                # ≤2.0x real-time
max_rtf_base = 2.5                # ≤2.5x real-time
max_memory_tiny_mb = 150          # Peak memory
max_memory_base_mb = 350          # Peak memory

Toyota Way Principles

Jidoka: Automatic quality gates prevent defects
Kaizen: Continuous improvement through iteration
Genchi Genbutsu: Tests verify actual behavior, not assumptions

Contributing

Contributions are welcome! Please follow these guidelines:

Development Workflow

Fork the repository
Create a feature branch from master
Make your changes
Run quality gates: make lint && make test && make coverage
Ensure coverage remains above 95%
Submit a pull request

Code Standards

All code must pass cargo clippy -- -D warnings
Format with cargo fmt
No unwrap() calls - use Result types
Zero TODO/FIXME/HACK comments - create tickets instead
Document all public APIs

Testing Requirements

# Run all quality gates
make test          # All tests
make coverage      # Must be >= 95%
pmat quality-gate  # Must pass

Commit Messages

Follow conventional commits format:

feat(module): add new feature
fix(module): fix bug
refactor(module): improve code
docs(module): update documentation

Roadmap

v0.2.0 (Current Release)

Full Whisper architecture (encoder-decoder transformer)
Int4/Int8 quantization with .apr format
WASM SIMD acceleration
Streaming transcription
99 language support with auto-detection
Greedy and beam search decoding
GPU-resident tensor architecture via trueno-gpu
CUDA acceleration with 5.8x speedup
2,273+ tests, TDG A grade (90.9/100)

v0.3.0 (Planned)

WebGPU acceleration
Turbo model support
Word-level timestamps
Voice activity detection

License

Licensed under the MIT License. See LICENSE for details.