audiobook-creation-exchange
Pure-Rust DSP crate that post-processes AI-generated speech audio to meet ACX (Audiobook Creation Exchange) submission requirements.
Standards enforced (defaults)
| Metric | Target | Notes |
|---|---|---|
| RMS | −20.5 dBFS (−23 … −18 window) | ACX spec |
| True-peak | ≤ −3 dBFS | ACX spec |
| Noise floor | ≤ −60 dBFS | ACX spec |
| Integrated LUFS | measured | ITU-R BS.1770-4 |
| Loudness Range | measured | EBU R 128 |
Processing pipeline
Applied in order by process():
- Click suppression — removes sub-10 ms transient spikes via cubic Hermite interpolation before any spectral processing.
- DC offset removal — subtracts the mean sample value; prevents clicks at chapter edit points.
- Noise reduction — Wiener spectral subtraction profiled from the leading silence; disabled by default (enable via
AcxConfig::denoise_enabled). - EQ warmth — low-shelf (+2 dB @ 180 Hz) and high-shelf (+1.5 dB @ 5 kHz) biquad IIR shelving filters.
- De-essing — OLA STFT; reduces 5–8 kHz sibilance band by up to −6 dB when its energy ratio exceeds threshold.
- Plosive suppression — same OLA STFT approach; attenuates sub-150 Hz bins by −6 dB in windows with excessive plosive energy.
- Multiband compression — 3-band Linkwitz-Riley 4th-order crossover compressor; controls dynamics per band before normalisation.
- Normalise — linear gain targeting −20.5 dBFS, pre-compensated for the energy dilution from the 1 s head + 3 s tail room-tone bookends.
- Brickwall limiter — 5 ms lookahead; no sample (including 4× interpolated inter-sample peaks) exceeds −3 dBFS.
- Breath removal — replaces breath-band windows with room tone; disabled by default.
- Pause normalisation — caps over-long inter-sentence pauses to natural targets (sentence 120 ms, paragraph 400 ms, scene break 700 ms).
- Noise gate → room tone — sub-threshold 50 ms windows replaced with synthesised 1/f pink noise at −62 dBFS.
- Bookend padding — first 1 s and last 3 s forced to room tone with a 10 ms crossfade at each boundary.
- Compliance verify — second analysis pass; returns
Err(AcxError::StillNonCompliant)only when audio cannot be brought into compliance.
Quick start
use audiobook_creation_exchange as acx;
let raw_pcm: = tts_engine_output; // L16-LE mono bytes from your TTS engine
// Optional read-only diagnostic pass
let diag = validate?;
if diag.has_dc_offset
// Full pipeline
let processed = process?;
LoudnessPreset config factory
Map a delivery character to a tuned AcxConfig within or adjacent to the ACX window:
use LoudnessPreset;
// Intimate, close-mic narration (−22 dBFS target)
let cfg = Whispered.config;
let pcm = process_with_config?;
// Forward-placed, authoritative narration (−20 dBFS target)
let cfg = Projected.config;
let pcm = process_with_config?;
| Variant | RMS target | Window |
|---|---|---|
Whispered |
−22 dBFS | −23 … −21 |
Soft |
−21 dBFS | −23 … −19 |
Standard |
−20.5 dBFS | −23 … −18 (default) |
Projected |
−20 dBFS | −22 … −18 |
Loud |
−19 dBFS | −21 … −17 |
Batch loudness consistency
Check that all segments in a series have matching loudness:
use ;
use DEFAULT_TOLERANCE_DB;
let segments: = chapters
.iter
.map
.collect;
let refs: = segments.iter.map.collect;
let report = consistency_check;
if !report.compliant
Chapter crossfade
Seamless equal-power transition between two consecutive chapters:
use crossfade;
let joined = crossfade;
Custom config
Every pipeline stage can be individually toggled or tuned via AcxConfig:
use AcxConfig;
use Duration;
let cfg = AcxConfig ;
let processed = process_with_config?;
Enabling noise reduction
denoise_enabled is false by default because the Wiener subtraction profiles noise from the first denoise_profile_ms milliseconds of audio. If the signal starts with speech rather than room tone, those speech frames are treated as noise and attenuated. Enable it only when your pipeline guarantees a silent lead-in:
let cfg = AcxConfig ;
Diagnostic API
validate() is read-only — it never modifies audio. The returned DiagnosticReport covers:
- ACX core — RMS, true-peak (4× oversampled), noise floor, overall compliance flag
- DC offset — mean deviation as fraction of full scale
- Spectral — FFT sibilance (4–10 kHz) and plosive (20–150 Hz) violations with timestamps
- Temporal — dead air violations (> 10 s contiguous silence), head/tail bookend status, digital zero run count
- LUFS/LRA — integrated loudness per ITU-R BS.1770-4 and loudness range per EBU R 128
let diag = validate?;
// Spectral violations with timestamps (time::Duration)
for v in &diag.spectral_violations
// Dead air blocks
if let Some = diag.dead_air_violations.iter.map.max
MP3 validation (after encoding)
let mp3_bytes: = encoder_output;
let cbr = check_cbr;
assert!;
let id3 = check_id3_tags;
if !id3.complete
Module overview
| Module | Role |
|---|---|
analyse |
RMS, true-peak (4× oversampled), noise floor, full AcxReport |
normalise |
Linear gain to target RMS |
limiter |
5 ms lookahead brickwall limiter |
gate |
Sub-threshold window replacement + bookend padding |
room_tone |
Voss-McCartney 1/f pink noise generator |
dc_offset |
Mean-offset measurement and removal |
click |
Sub-10 ms transient spike suppression via cubic Hermite interpolation |
denoise |
Wiener spectral subtraction noise reduction |
eq |
Low-shelf and high-shelf biquad IIR warmth EQ |
deess |
OLA STFT de-esser (5–8 kHz sibilance reduction) |
plosive |
OLA STFT plosive suppressor (sub-150 Hz shelving) |
multiband |
3-band Linkwitz-Riley compressor (250 Hz / 3 kHz crossovers) |
pause_norm |
Inter-sentence pause normaliser (sentence / paragraph / scene-break) |
breath |
Breath window detection and room-tone replacement |
crossfade |
Equal-power chapter crossfade |
consistency |
Inter-segment RMS variance check |
loudness_preset |
LoudnessPreset → AcxConfig factory |
spectral |
FFT-based sibilance and plosive detection |
temporal |
Dead air, head/tail bookends, digital zero runs |
bitstream |
MP3 CBR frame header and ID3v2 tag validation |
lufs |
Integrated LUFS (ITU-R BS.1770-4) and loudness range (EBU R 128) |
Audio format
All DSP functions operate on L16 mono PCM — raw i16 little-endian samples with no file header — at the sample rate passed by the caller. 24 000 Hz is the typical output rate for speech synthesis engines.
bytes_to_samples() / samples_to_bytes() convert between the Vec<u8> that TTS APIs return and the Vec<i16> the DSP functions use.