rlx-fft
Learned butterfly FFT + spectral pipelines (mel, Welch PSD, top-K Welch peaks), compiled via RLX.
Welch peaks (fast top-K spikes)
Extract top-K frequency spikes (bin, power) without materializing a full Welch PSD. The fast path uses 2 Welch segments (vs 8 in full Welch); an ultra-fast path uses 1 segment for minimum latency.
CLI bench
# Auto strategy (default) — picks fastest path for batch + device
# Batch sweep + Metal GPU crossover
# Force a specific strategy (see table below)
# K sweep — plot latency vs top-K (JSON rows tagged with batch + k)
Sweep output ends with a k crossover table (rustfft / stream / rlx / picker ms per K). Combine with --batch for a full grid, e.g. --batch 32,8192 --k 4,16,64.
| Flag | Default | Description |
|---|---|---|
--n-fft |
256 |
FFT size |
--batch |
32 |
Batch size, CSV (32,1024), or power-of-two range (32-8192) |
--k |
16 |
Peaks per row; CSV (4,8,16,32) or power-of-two range (4-64) for K sweep |
--device |
auto |
cpu, metal, cuda, … |
--strategy |
auto |
auto, ultra, fast, rlx, learned |
--train-steps |
200 |
Train a lightweight learned model (0 to skip); uses --k for peak loss |
--iters |
50 |
Timing iterations |
--no-compiled |
— | Skip explicit RLX/learned compiled baseline rows |
--no-ultra-fast |
— | Skip ultra-fast baseline row |
--json PATH |
— | Write JSON report |
Bench output includes a welch_peaks_picker_<strategy> row using auto or forced selection, e.g.:
[welch-peaks] picker (auto): batch=8192 device=Metal -> rlx_compiled
Strategy picker
Use AutoWelchPeaks in Rust or --strategy on the CLI.
| Strategy | Label | When to use |
|---|---|---|
| auto | (resolved at runtime) | Default — picks from batch + device |
| ultra | ultra_fast_rustfft |
Smallest batch, lowest latency (1 segment) |
| fast | fast_streaming_rustfft |
CPU / mid batch; best accuracy vs speed on rustfft |
| rlx | rlx_compiled |
Large batch on GPU (Metal/CUDA/…) |
| learned | learned_compiled |
Large batch + sparse learned gates + trained model |
Auto selection rules
| Condition | Picked strategy |
|---|---|
batch ≤ 256 on CPU, ≤ 128 on GPU |
ultra (1 segment) |
| Mid batch | fast (2 segments, streaming top-K) |
batch ≥ 8192 on GPU |
rlx (compiled Op::Fft) |
GPU + batch ≥ 8192 + learned model with <25% active gates |
learned |
Threshold helpers: rlx_crossover_batch(device) → 8192 on GPU; ultra_fast_max_batch(device) → 256 CPU / 128 GPU.
Reference peaks for training/bench error always use full 8-segment Welch; student paths use 1–2 segments.
Rust API
use ;
// Auto (recommended)
let mut picker = new?;
println!;
// Force a strategy
let mut picker = with_strategy?;
// Parse CLI-style string
let mode = parse_welch_peaks_strategy?; // Force(FastStreaming)
let mut picker = with_options?;
// With learned model (for learned strategy or auto sparse-gate path)
let mut picker = with_learned?;
// signal: [batch × full_welch_frame] — 8-segment layout buffer
let peaks = picker.welch_peaks_batch?; // [batch, k, 2] packed (bin, power)
Strategy string aliases (for parse_welch_peaks_strategy / --strategy):
| Input | Maps to |
|---|---|
auto |
Auto pick |
ultra, ultra-fast, 1seg |
UltraFast |
fast, streaming, rustfft, 2seg |
FastStreaming |
rlx, compiled, gpu |
RlxCompiled |
learned, learned_compiled |
LearnedCompiled |
Performance notes (n=256, Apple Silicon reference)
| Batch | Best auto pick (typical) | vs full Welch |
|---|---|---|
| 32 | ultra (~0.04 ms CPU) | ~4–5× faster |
| 1024 | fast streaming | ~3× faster |
| 8192 | rlx Metal (~40 ms) | ~2× faster than rustfft fast at this batch |
RLX compiled paths need large batch to amortize GPU launch; rustfft wins at small batch.
Training peaks into the learned model
End-to-end training includes a peak-matching loss on the fast 2-segment path. --k / --peak-k sets how many spikes are matched during training and at inference (learned, compiled-learned, and picker learned strategy).
# bench-e2e: same K for WelchPeaks pipelines + teacher training
At inference, FastLearnedFftModel::welch_peaks_batch accepts any WelchPeakParams::fast_for_n_fft(n_fft, k) — K is not baked into weights, but training with the target K improves peak accuracy.
Tests
Modules
| Module | Role |
|---|---|
peak |
WelchPeakParams, streaming top-K, WelchPeaksScratch |
welch_peaks_picker |
AutoWelchPeaks, auto/forced strategy |
welch_peaks_compile |
CompiledRlxWelchPeaks, CompiledLearnedWelchPeaks |
bench_welch_peaks |
CLI bench + batch sweep |