audiobook-creation-exchange

Pure-Rust DSP crate that post-processes AI-generated speech audio to meet ACX (Audiobook Creation Exchange) submission requirements.

Standards enforced (defaults)

Metric	Target	Notes
RMS	−20.5 dBFS (−23 … −18 window)	ACX spec
True-peak	≤ −3 dBFS	ACX spec
Noise floor	≤ −60 dBFS	ACX spec
Integrated LUFS	measured	ITU-R BS.1770-4
Loudness Range	measured	EBU R 128

Processing pipeline

Applied in order by process():

Click suppression — removes sub-10 ms transient spikes via cubic Hermite interpolation before any spectral processing.
DC offset removal — subtracts the mean sample value; prevents clicks at chapter edit points.
Noise reduction — Wiener spectral subtraction profiled from the leading silence; disabled by default (enable via AcxConfig::denoise_enabled).
EQ warmth — low-shelf (+2 dB @ 180 Hz) and high-shelf (+1.5 dB @ 5 kHz) biquad IIR shelving filters.
De-essing — OLA STFT; reduces 5–8 kHz sibilance band by up to −6 dB when its energy ratio exceeds threshold.
Plosive suppression — same OLA STFT approach; attenuates sub-150 Hz bins by −6 dB in windows with excessive plosive energy.
Multiband compression — 3-band Linkwitz-Riley 4th-order crossover compressor; controls dynamics per band before normalisation.
Normalise — linear gain targeting −20.5 dBFS, pre-compensated for the energy dilution from the 1 s head + 3 s tail room-tone bookends.
Brickwall limiter — 5 ms lookahead; no sample (including 4× interpolated inter-sample peaks) exceeds −3 dBFS.
Breath removal — replaces breath-band windows with room tone; disabled by default.
Pause normalisation — caps over-long inter-sentence pauses to natural targets (sentence 120 ms, paragraph 400 ms, scene break 700 ms).
Noise gate → room tone — sub-threshold 50 ms windows replaced with synthesised 1/f pink noise at −62 dBFS.
Bookend padding — first 1 s and last 3 s forced to room tone with a 10 ms crossfade at each boundary.
Compliance verify — second analysis pass; returns Err(AcxError::StillNonCompliant) only when audio cannot be brought into compliance.

Quick start

use audiobook_creation_exchange as acx;

let raw_pcm: Vec<u8> = tts_engine_output(); // L16-LE mono bytes from your TTS engine

// Optional read-only diagnostic pass
let diag = acx::validate(&raw_pcm, 24_000)?;
if diag.has_dc_offset {
    println!("DC offset: {:.4}", diag.dc_offset);
}

// Full pipeline
let processed = acx::process(&raw_pcm, 24_000)?;

LoudnessPreset config factory

Map a delivery character to a tuned AcxConfig within or adjacent to the ACX window:

use audiobook_creation_exchange::LoudnessPreset;

// Intimate, close-mic narration (−22 dBFS target)
let cfg = LoudnessPreset::Whispered.config();
let pcm = acx::process_with_config(&raw_pcm, 24_000, &cfg)?;

// Forward-placed, authoritative narration (−20 dBFS target)
let cfg = LoudnessPreset::Projected.config();
let pcm = acx::process_with_config(&raw_pcm, 24_000, &cfg)?;

Variant	RMS target	Window
`Whispered`	−22 dBFS	−23 … −21
`Soft`	−21 dBFS	−23 … −19
`Standard`	−20.5 dBFS	−23 … −18 (default)
`Projected`	−20 dBFS	−22 … −18
`Loud`	−19 dBFS	−21 … −17

Batch loudness consistency

Check that all segments in a series have matching loudness:

use audiobook_creation_exchange::{bytes_to_samples, consistency_check};
use audiobook_creation_exchange::consistency::DEFAULT_TOLERANCE_DB;

let segments: Vec<Vec<i16>> = chapters
    .iter()
    .map(|b| bytes_to_samples(b).unwrap())
    .collect();

let refs: Vec<&[i16]> = segments.iter().map(AsRef::as_ref).collect();
let report = consistency_check(&refs, DEFAULT_TOLERANCE_DB);

if !report.compliant {
    eprintln!(
        "Loudness spread {:.1} dB exceeds {:.1} dB tolerance",
        report.max_deviation_db, DEFAULT_TOLERANCE_DB
    );
    for (i, rms) in report.episode_rms_db.iter().enumerate() {
        eprintln!("  chapter {}: {:.1} dBFS", i + 1, rms);
    }
}

Chapter crossfade

Seamless equal-power transition between two consecutive chapters:

use audiobook_creation_exchange::crossfade;

let joined = crossfade(&chapter_a, &chapter_b, 80 /* ms */, 24_000);

Custom config

Every pipeline stage can be individually toggled or tuned via AcxConfig:

use audiobook_creation_exchange::AcxConfig;
use time::Duration;

let cfg = AcxConfig {
    rms_target_db: -21.5,
    rms_min_db: -22.0,
    rms_max_db: -21.0,
    peak_ceiling_db: -3.0,
    noise_floor_max_db: -60.0,
    silence_threshold_db: -65.0,
    room_tone_db: -62.0,
    dead_air_limit: Duration::seconds(10),
    sibilance_ratio_threshold: 0.55,
    plosive_ratio_threshold: 0.35,
    // click suppression
    click_suppression_enabled: true,
    // noise reduction (requires leading silence for profiling)
    denoise_enabled: false,
    denoise_profile_ms: 200,
    denoise_oversubtraction: 1.5,
    denoise_spectral_floor: 0.1,
    // warmth EQ
    eq_enabled: true,
    eq_low_shelf_db: 2.0,
    eq_high_shelf_db: 1.5,
    // de-essing
    deess_enabled: true,
    deess_threshold_ratio: 0.45,
    deess_max_reduction_db: 6.0,
    // plosive suppression
    plosive_suppression_enabled: true,
    plosive_attenuation_db: 6.0,
    // multiband compression
    multiband_enabled: true,
    // breath removal
    breath_removal_enabled: false,
    // pause normalisation
    pause_norm_enabled: true,
    pause_sentence_target_ms: 120,
    pause_paragraph_target_ms: 400,
    pause_scene_target_ms: 700,
};
let processed = acx::process_with_config(&raw_pcm, 24_000, &cfg)?;

Enabling noise reduction

denoise_enabled is false by default because the Wiener subtraction profiles noise from the first denoise_profile_ms milliseconds of audio. If the signal starts with speech rather than room tone, those speech frames are treated as noise and attenuated. Enable it only when your pipeline guarantees a silent lead-in:

let cfg = AcxConfig {
    denoise_enabled: true,
    denoise_profile_ms: 200, // profile the first 200 ms as noise
    ..AcxConfig::default()
};

Diagnostic API

validate() is read-only — it never modifies audio. The returned DiagnosticReport covers:

ACX core — RMS, true-peak (4× oversampled), noise floor, overall compliance flag
DC offset — mean deviation as fraction of full scale
Spectral — FFT sibilance (4–10 kHz) and plosive (20–150 Hz) violations with timestamps
Temporal — dead air violations (> 10 s contiguous silence), head/tail bookend status, digital zero run count
LUFS/LRA — integrated loudness per ITU-R BS.1770-4 and loudness range per EBU R 128

let diag = acx::validate(&raw_pcm, 24_000)?;

// Spectral violations with timestamps (time::Duration)
for v in &diag.spectral_violations {
    println!("{:?} at {}ms — band ratio {:.2}", v.kind, v.time.whole_milliseconds(), v.band_ratio);
}

// Dead air blocks
if let Some(worst) = diag.dead_air_violations.iter().map(|v| v.duration).max() {
    println!("Longest dead-air: {:.1}s", worst.whole_milliseconds() as f64 / 1000.0);
}

MP3 validation (after encoding)

let mp3_bytes: Vec<u8> = encoder_output();

let cbr = acx::check_cbr(&mp3_bytes);
assert!(cbr.is_cbr, "expected CBR at {} kbps", cbr.detected_bitrate_kbps.unwrap_or(0));

let id3 = acx::check_id3_tags(&mp3_bytes);
if !id3.complete {
    println!("missing ID3 fields: {:?}", id3.missing);
}

Module overview

Module	Role
`analyse`	RMS, true-peak (4× oversampled), noise floor, full `AcxReport`
`normalise`	Linear gain to target RMS
`limiter`	5 ms lookahead brickwall limiter
`gate`	Sub-threshold window replacement + bookend padding
`room_tone`	Voss-McCartney 1/f pink noise generator
`dc_offset`	Mean-offset measurement and removal
`click`	Sub-10 ms transient spike suppression via cubic Hermite interpolation
`denoise`	Wiener spectral subtraction noise reduction
`eq`	Low-shelf and high-shelf biquad IIR warmth EQ
`deess`	OLA STFT de-esser (5–8 kHz sibilance reduction)
`plosive`	OLA STFT plosive suppressor (sub-150 Hz shelving)
`multiband`	3-band Linkwitz-Riley compressor (250 Hz / 3 kHz crossovers)
`pause_norm`	Inter-sentence pause normaliser (sentence / paragraph / scene-break)
`breath`	Breath window detection and room-tone replacement
`crossfade`	Equal-power chapter crossfade
`consistency`	Inter-segment RMS variance check
`loudness_preset`	`LoudnessPreset` → `AcxConfig` factory
`spectral`	FFT-based sibilance and plosive detection
`temporal`	Dead air, head/tail bookends, digital zero runs
`bitstream`	MP3 CBR frame header and ID3v2 tag validation
`lufs`	Integrated LUFS (ITU-R BS.1770-4) and loudness range (EBU R 128)

Audio format

All DSP functions operate on L16 mono PCM — raw i16 little-endian samples with no file header — at the sample rate passed by the caller. 24 000 Hz is the typical output rate for speech synthesis engines.

bytes_to_samples() / samples_to_bytes() convert between the Vec<u8> that TTS APIs return and the Vec<i16> the DSP functions use.

Dependencies

rustfft — SIMD-accelerated FFT for spectral analysis, de-essing, plosive suppression, and noise reduction
rand — Voss-McCartney pink noise generation
thiserror — typed error enum
time — typed Duration values for temporal measurements

audiobook-creation-exchange 0.1.0