Mel Spec
A Rust implementation of mel spectrograms with support for:
- Whisper-compatible mel spectrograms (aligned with whisper.cpp, PyTorch, librosa)
- Kaldi-compatible filterbank features (matching kaldi_native_fbank output)
- NeMo-compatible mel filters
Examples
See wavey-ai/hush for live demo

- stream microphone or wav to mel wasm worker
- stream from ffmpeg to whisper.cpp
- convert audio to mel spectrograms and save to image
- transcribe images with whisper.cpp
Usage
use *;
Kaldi-compatible Filterbank Features
The fbank module provides Kaldi-style filterbank features with parity to kaldi_native_fbank. This is useful for speaker embedding models like WeSpeaker and pyannote.
use ;
// Default config matches kaldi_native_fbank defaults
let config = default;
let fbank = new;
// Compute features from audio samples (mono, f32, 16kHz)
let features = fbank.compute;
// Returns Array2<f32> with shape (num_frames, 80)
Kaldi defaults:
- Sample rate: 16000 Hz
- Mel bins: 80
- Frame length: 25ms (400 samples)
- Frame shift: 10ms (160 samples)
- Window: Povey (like Hamming but goes to zero at edges)
- Preemphasis: 0.97
- CMN: enabled (subtract mean across time)
Mel Filterbank (Whisper/librosa compatible)
Mel filterbanks within 1.0e-7 of librosa and identical to whisper GGML model-embedded filters.
use mel;
let sampling_rate = 16000.0;
let fft_size = 400;
let n_mels = 80;
let filters = mel;
// Returns Array2<f64> with shape (80, 201)
Spectrogram using Short Time Fourier Transform
STFT with overlap-and-save that has parity with PyTorch and whisper.cpp. Suitable for streaming audio.
use Spectrogram;
let fft_size = 400;
let hop_size = 160;
let mut spectrogram = new;
// Add PCM audio samples
let samples: = vec!;
if let Some = spectrogram.add
STFT to Mel Spectrogram
Apply a pre-computed filterbank to FFT results. Output is identical to whisper.cpp and whisper.py.
use MelSpectrogram;
use Array1;
use Complex;
let fft_size = 400;
let sampling_rate = 16000.0;
let n_mels = 80;
let mut mel_spec = new;
let fft_input = from;
let mel_frame = mel_spec.add;
RingBuffer for Streaming
For creating spectrograms from streaming audio, see RingBuffer in rb.rs.
Saving Mel Spectrograms to TGA
Mel spectrograms can be saved as 8-bit TGA images (uncompressed, supported by macOS and Windows). These images encode quantized mel spectrogram data that whisper.cpp can process directly without audio input.
use ;
// Save spectrogram
save_tga_8bit.unwrap;
// Load and use with whisper.cpp
let mel = load_tga_8bit.unwrap;
TGA files are lossless for speech-to-text - they encode all information available in the model's view of raw audio.
ffmpeg -i audio.mp3 -f f32le -ar 16000 -ac 1 pipe:1 | ./target/release/tga_whisper -t spectrogram.tga
"the quest for peace."
Voice Activity Detection
Uses Sobel edge detection to find speech boundaries in mel spectrograms. This enables real-time processing by finding natural cut points between words/phrases.
use ;
let settings = DetectionSettings ;
let vad = new;
Speech in mel spectrograms is characterized by clear gradients. The VAD finds vertical gaps suitable for cutting, and drops frames that look like gaps in speech (which cause Whisper hallucinations).

Examples from JFK speech:
Energy but no speech - VAD correctly rejects:

Fleeting word - VAD correctly detects:

Full JFK transcript with VAD: jfk_transcript_golden.txt
Performance
CPU Performance
Benchmarks on Apple M1 Pro (single-threaded, release build):
| Audio Length | Frames | Time | Throughput |
|---|---|---|---|
| 10s | 997 | 21ms | 476x realtime |
| 60s | 5997 | 124ms | 484x realtime |
| 300s (5 min) | 29997 | 622ms | 482x realtime |
Full mel spectrogram pipeline (STFT + mel filterbank + log) at 16kHz, FFT size 512, hop 160, 80 mel bins.
CPU performance is excellent - processing 5 minutes of audio in 622ms means the library is ~480x faster than realtime on a single core.
GPU Acceleration
This library focuses on pure Rust CPU implementation. For GPU acceleration, consider:
| Option | Speedup | Notes |
|---|---|---|
| NVIDIA NeMo | ~10x over CPU | Python/PyTorch, uses cuBLAS/cuDNN, best for batch processing |
| torchaudio | ~5-10x | Python/PyTorch, CUDA backend |
| mel-spec gpu branch | ~1.6x | Experimental, requires CUDA toolkit + nvcc |
Options:
- NeMo / torchaudio → Python/PyTorch with CUDA, best for batch processing
- gpu branch → Experimental CUDA support (~1.6x speedup), requires CUDA toolkit + nvcc
- wgpu/rust-gpu → Pure Rust GPU (ecosystem maturing)
The gpu branch keeps the full pipeline on GPU (STFT → mel filterbank → log). Requires C++ CUDA kernels and NVIDIA hardware.
Discussion
- Mel spectrograms encode at 6.4KB/sec (80 x 2 bytes x 40 frames)
- Float PCM for Whisper is 64KB/sec at 16kHz
whisper.cpp produces mel spectrograms with 1.0e-6 precision, but these are invariant to 8-bit quantization. We can save as 8-bit images without losing useful information.
We only need 1.0e-1 precision for accurate results, and rounding may actually improve some difficult transcriptions:
Original: [0.158, 0.266, 0.076, 0.196, 0.167, ...]
Rounded: [0.2, 0.3, 0.1, 0.2, 0.2, ...]
Once quantized, the spectrograms are the same:
(top: original, bottom: rounded to 1.0e-1)
Speech is encapsulated almost entirely in the frequency domain, and the mel scale effectively divides frequencies into 80 bins. 8-bits of grayscale is probably overkill - it could be compressed further.