audiobook-creation-exchange 0.1.0

ACX-compliant audio post-processing: normalisation, limiting, gating, LUFS measurement, and spectral analysis for AI-generated speech audio.
Documentation

audiobook-creation-exchange

Cargo Test docs.rs

Pure-Rust DSP crate that post-processes AI-generated speech audio to meet ACX (Audiobook Creation Exchange) submission requirements.


Standards enforced (defaults)

Metric Target Notes
RMS −20.5 dBFS (−23 … −18 window) ACX spec
True-peak ≤ −3 dBFS ACX spec
Noise floor ≤ −60 dBFS ACX spec
Integrated LUFS measured ITU-R BS.1770-4
Loudness Range measured EBU R 128

Processing pipeline

Applied in order by process():

  1. Click suppression — removes sub-10 ms transient spikes via cubic Hermite interpolation before any spectral processing.
  2. DC offset removal — subtracts the mean sample value; prevents clicks at chapter edit points.
  3. Noise reduction — Wiener spectral subtraction profiled from the leading silence; disabled by default (enable via AcxConfig::denoise_enabled).
  4. EQ warmth — low-shelf (+2 dB @ 180 Hz) and high-shelf (+1.5 dB @ 5 kHz) biquad IIR shelving filters.
  5. De-essing — OLA STFT; reduces 5–8 kHz sibilance band by up to −6 dB when its energy ratio exceeds threshold.
  6. Plosive suppression — same OLA STFT approach; attenuates sub-150 Hz bins by −6 dB in windows with excessive plosive energy.
  7. Multiband compression — 3-band Linkwitz-Riley 4th-order crossover compressor; controls dynamics per band before normalisation.
  8. Normalise — linear gain targeting −20.5 dBFS, pre-compensated for the energy dilution from the 1 s head + 3 s tail room-tone bookends.
  9. Brickwall limiter — 5 ms lookahead; no sample (including 4× interpolated inter-sample peaks) exceeds −3 dBFS.
  10. Breath removal — replaces breath-band windows with room tone; disabled by default.
  11. Pause normalisation — caps over-long inter-sentence pauses to natural targets (sentence 120 ms, paragraph 400 ms, scene break 700 ms).
  12. Noise gate → room tone — sub-threshold 50 ms windows replaced with synthesised 1/f pink noise at −62 dBFS.
  13. Bookend padding — first 1 s and last 3 s forced to room tone with a 10 ms crossfade at each boundary.
  14. Compliance verify — second analysis pass; returns Err(AcxError::StillNonCompliant) only when audio cannot be brought into compliance.

Quick start

use audiobook_creation_exchange as acx;

let raw_pcm: Vec<u8> = tts_engine_output(); // L16-LE mono bytes from your TTS engine

// Optional read-only diagnostic pass
let diag = acx::validate(&raw_pcm, 24_000)?;
if diag.has_dc_offset {
    println!("DC offset: {:.4}", diag.dc_offset);
}

// Full pipeline
let processed = acx::process(&raw_pcm, 24_000)?;

LoudnessPreset config factory

Map a delivery character to a tuned AcxConfig within or adjacent to the ACX window:

use audiobook_creation_exchange::LoudnessPreset;

// Intimate, close-mic narration (−22 dBFS target)
let cfg = LoudnessPreset::Whispered.config();
let pcm = acx::process_with_config(&raw_pcm, 24_000, &cfg)?;

// Forward-placed, authoritative narration (−20 dBFS target)
let cfg = LoudnessPreset::Projected.config();
let pcm = acx::process_with_config(&raw_pcm, 24_000, &cfg)?;
Variant RMS target Window
Whispered −22 dBFS −23 … −21
Soft −21 dBFS −23 … −19
Standard −20.5 dBFS −23 … −18 (default)
Projected −20 dBFS −22 … −18
Loud −19 dBFS −21 … −17

Batch loudness consistency

Check that all segments in a series have matching loudness:

use audiobook_creation_exchange::{bytes_to_samples, consistency_check};
use audiobook_creation_exchange::consistency::DEFAULT_TOLERANCE_DB;

let segments: Vec<Vec<i16>> = chapters
    .iter()
    .map(|b| bytes_to_samples(b).unwrap())
    .collect();

let refs: Vec<&[i16]> = segments.iter().map(AsRef::as_ref).collect();
let report = consistency_check(&refs, DEFAULT_TOLERANCE_DB);

if !report.compliant {
    eprintln!(
        "Loudness spread {:.1} dB exceeds {:.1} dB tolerance",
        report.max_deviation_db, DEFAULT_TOLERANCE_DB
    );
    for (i, rms) in report.episode_rms_db.iter().enumerate() {
        eprintln!("  chapter {}: {:.1} dBFS", i + 1, rms);
    }
}

Chapter crossfade

Seamless equal-power transition between two consecutive chapters:

use audiobook_creation_exchange::crossfade;

let joined = crossfade(&chapter_a, &chapter_b, 80 /* ms */, 24_000);

Custom config

Every pipeline stage can be individually toggled or tuned via AcxConfig:

use audiobook_creation_exchange::AcxConfig;
use time::Duration;

let cfg = AcxConfig {
    rms_target_db: -21.5,
    rms_min_db: -22.0,
    rms_max_db: -21.0,
    peak_ceiling_db: -3.0,
    noise_floor_max_db: -60.0,
    silence_threshold_db: -65.0,
    room_tone_db: -62.0,
    dead_air_limit: Duration::seconds(10),
    sibilance_ratio_threshold: 0.55,
    plosive_ratio_threshold: 0.35,
    // click suppression
    click_suppression_enabled: true,
    // noise reduction (requires leading silence for profiling)
    denoise_enabled: false,
    denoise_profile_ms: 200,
    denoise_oversubtraction: 1.5,
    denoise_spectral_floor: 0.1,
    // warmth EQ
    eq_enabled: true,
    eq_low_shelf_db: 2.0,
    eq_high_shelf_db: 1.5,
    // de-essing
    deess_enabled: true,
    deess_threshold_ratio: 0.45,
    deess_max_reduction_db: 6.0,
    // plosive suppression
    plosive_suppression_enabled: true,
    plosive_attenuation_db: 6.0,
    // multiband compression
    multiband_enabled: true,
    // breath removal
    breath_removal_enabled: false,
    // pause normalisation
    pause_norm_enabled: true,
    pause_sentence_target_ms: 120,
    pause_paragraph_target_ms: 400,
    pause_scene_target_ms: 700,
};
let processed = acx::process_with_config(&raw_pcm, 24_000, &cfg)?;

Enabling noise reduction

denoise_enabled is false by default because the Wiener subtraction profiles noise from the first denoise_profile_ms milliseconds of audio. If the signal starts with speech rather than room tone, those speech frames are treated as noise and attenuated. Enable it only when your pipeline guarantees a silent lead-in:

let cfg = AcxConfig {
    denoise_enabled: true,
    denoise_profile_ms: 200, // profile the first 200 ms as noise
    ..AcxConfig::default()
};

Diagnostic API

validate() is read-only — it never modifies audio. The returned DiagnosticReport covers:

  • ACX core — RMS, true-peak (4× oversampled), noise floor, overall compliance flag
  • DC offset — mean deviation as fraction of full scale
  • Spectral — FFT sibilance (4–10 kHz) and plosive (20–150 Hz) violations with timestamps
  • Temporal — dead air violations (> 10 s contiguous silence), head/tail bookend status, digital zero run count
  • LUFS/LRA — integrated loudness per ITU-R BS.1770-4 and loudness range per EBU R 128
let diag = acx::validate(&raw_pcm, 24_000)?;

// Spectral violations with timestamps (time::Duration)
for v in &diag.spectral_violations {
    println!("{:?} at {}ms — band ratio {:.2}", v.kind, v.time.whole_milliseconds(), v.band_ratio);
}

// Dead air blocks
if let Some(worst) = diag.dead_air_violations.iter().map(|v| v.duration).max() {
    println!("Longest dead-air: {:.1}s", worst.whole_milliseconds() as f64 / 1000.0);
}

MP3 validation (after encoding)

let mp3_bytes: Vec<u8> = encoder_output();

let cbr = acx::check_cbr(&mp3_bytes);
assert!(cbr.is_cbr, "expected CBR at {} kbps", cbr.detected_bitrate_kbps.unwrap_or(0));

let id3 = acx::check_id3_tags(&mp3_bytes);
if !id3.complete {
    println!("missing ID3 fields: {:?}", id3.missing);
}

Module overview

Module Role
analyse RMS, true-peak (4× oversampled), noise floor, full AcxReport
normalise Linear gain to target RMS
limiter 5 ms lookahead brickwall limiter
gate Sub-threshold window replacement + bookend padding
room_tone Voss-McCartney 1/f pink noise generator
dc_offset Mean-offset measurement and removal
click Sub-10 ms transient spike suppression via cubic Hermite interpolation
denoise Wiener spectral subtraction noise reduction
eq Low-shelf and high-shelf biquad IIR warmth EQ
deess OLA STFT de-esser (5–8 kHz sibilance reduction)
plosive OLA STFT plosive suppressor (sub-150 Hz shelving)
multiband 3-band Linkwitz-Riley compressor (250 Hz / 3 kHz crossovers)
pause_norm Inter-sentence pause normaliser (sentence / paragraph / scene-break)
breath Breath window detection and room-tone replacement
crossfade Equal-power chapter crossfade
consistency Inter-segment RMS variance check
loudness_preset LoudnessPresetAcxConfig factory
spectral FFT-based sibilance and plosive detection
temporal Dead air, head/tail bookends, digital zero runs
bitstream MP3 CBR frame header and ID3v2 tag validation
lufs Integrated LUFS (ITU-R BS.1770-4) and loudness range (EBU R 128)

Audio format

All DSP functions operate on L16 mono PCM — raw i16 little-endian samples with no file header — at the sample rate passed by the caller. 24 000 Hz is the typical output rate for speech synthesis engines.

bytes_to_samples() / samples_to_bytes() convert between the Vec<u8> that TTS APIs return and the Vec<i16> the DSP functions use.


Dependencies

  • rustfft — SIMD-accelerated FFT for spectral analysis, de-essing, plosive suppression, and noise reduction
  • rand — Voss-McCartney pink noise generation
  • thiserror — typed error enum
  • time — typed Duration values for temporal measurements