Expand description
§audio-analysis-transcription
Rust-native and compatibility transcription orchestration for video-analysis.
Native Candle Whisper ASR is available behind the candle feature. CUDA device
selection is available behind cuda, and local model bundle validation is
available behind model-bundles. Native path decoding defaults to direct
samples or readable WAV files. With the explicit audio-io feature, non-WAV
paths use FFmpeg-backed audio-analysis-io decode and then normalize/resample
through the same native 16 kHz mono boundary. This opt-in media decode is not
WhisperX parity and is not part of default tests.
Native Whisper tries model timestamp tokens automatically when the tokenizer
defines Whisper timestamp metadata. If timestamp decoding does not produce
bounded text segments, it falls back to chunk/window segment timing. Native
Whisper segment times are emitted as global transcript times. Timestamp-token
segments also include approximate projected word timings based on
character-weighted distribution inside each segment. wav2vec2/CTC alignment
can run through the same native transcription pipeline: alignment.enabled=true
with no bundle uses deterministic transcript timing alignment, while
alignment.enabled=true with a supported local wav2vec2 bundle uses Candle
wav2vec2 CTC alignment.
The external WhisperX command provider remains compatibility and parity tooling. It keeps Python-based execution explicit for callers that still need WhisperX decoding, batched ASR, alignment, or pyannote-backed diarization outside the default Rust path. It is also the current path for video/container inputs.
Native deterministic diarization is available behind diarization as a
heuristic spectral baseline, not a pyannote replacement or production speaker
recognition model. Diarization assignment runs after alignment when both
options are enabled, so the native diarization seam can use aligned word
timings, or segment timings when word timings are absent, as speech-span hints.
When no transcript timing is available it falls back to the energy-VAD baseline.
min_speakers and max_speakers are validated and reported in diagnostics;
they are not pyannote-style clustering constraints in the native baseline.
An ONNX speaker embedding provider is available only when explicitly configured
with diarization.speakerEmbeddingModelBundle and the crate is built with
diarization,onnx. It feeds the existing WindowedSpeakerDiarizer; heuristic
diarization remains the default. ONNX diagnostics include
diarizationRuntime=onnx, speakerEmbeddingProvider=onnx,
speakerEmbeddingDimension=N, and diarizationBaseline=false.
CTC alignment validates local wav2vec2 bundle files, config, tokenizer
vocabulary, and preprocessor metadata. Tokenizer discovery accepts
tokenizer.json and vocab.json; CTC blank resolution can use pad_token_id
when tokenizer metadata does not name a pad token. Supported local
Wav2Vec2ForCTC model.safetensors bundles execute through a private Candle
implementation and feed native CTC trellis/backtracking. Unsupported
architectures, stable-layer-norm configs, inconsistent positional-convolution
weight-norm tensors, or other unsupported safetensors layouts return typed
errors instead of falling back to deterministic timing. The Debug-only
audio.transcription.alignmentBundlePlan operation inspects local wav2vec2
bundle layout metadata without model inference and reports architecture,
stable-layer-norm status, positional-convolution layout, feature-extractor
norm, encoder layer count, missing tensor keys, and unsupported reasons.
Candle Whisper batch options are deterministic rather than concurrent in this
phase. max_batch_size=0 is rejected, chunk order and global timing are
preserved, and diagnostics report chunkCount, batchChunks, maxBatchSize,
batchCount, and batchExecution=candle-whisper-sequential. This is semantic
batch grouping, not throughput parity or tensor-batched model execution.
Transcript contracts, normalization, caption formatting, and WhisperX JSON
import remain owned by text-transcripts.
§Package Operations
audio.transcription.transcribe: run real transcription through the selected provider. Native Candle Whisper uses local WAV input or direct samples; the external WhisperX provider remains available for compatibility.audio.transcription.importWhisperX: import existing WhisperX JSON without running external tools.audio.transcription.providers: inspect available provider families.audio.transcription.plan: describe runtime setup without execution.audio.transcription.modelPlan: inspect ASR model requirements.audio.transcription.vadPlan: inspect deterministic VAD defaults.audio.transcription.alignmentPlan: inspect CTC alignment requirements.audio.transcription.alignmentBundlePlan: inspect local wav2vec2 bundle readiness without model inference.audio.transcription.decodePlan: explain source decode routing without opening files or running FFmpeg.audio.transcription.diarizationPlan: explain heuristic diarization status, assignment policies, and future model-backed provider directions.describe: inspect package metadata.
§Setup
For native Candle Whisper execution, provide a local model bundle containing:
config.jsongeneration_config.jsontokenizer.jsonpreprocessor_config.jsonmodel.safetensors
For native wav2vec2 CTC alignment, provide a local Wav2Vec2ForCTC bundle
containing:
config.jsontokenizer.jsonorvocab.jsonpreprocessor_config.jsonmodel.safetensors
The real wav2vec2 smoke prints the alignment bundle layout report before
execution. Stable-layer-norm bundles remain unsupported_runtime until that
architecture path is implemented.
2026-06-10 validation update: scripts/sync_model_bundles.sh provisioned
facebook/wav2vec2-base-960h under the ignored smoke model root. The ignored
real-bundle alignment smoke passed with vocab.json tokenizer discovery and
reported architecture="Wav2Vec2ForCTC",
do_stable_layer_norm=false, positional_conv_layout="weight-norm",
feature_extractor_norm="group", encoder_layer_count=12, no missing keys,
and no unsupported layout reasons. Native positional convolution reconstruction
now supports the observed per-kernel weight-norm weight_g layout used by this
bundle.
Use candle,model-bundles for CPU local smoke tests and add cuda for
CUDA-backed Whisper smoke tests. No runtime downloads are performed by this
crate.
Use audio-io only when native non-WAV media/container decode is explicitly
needed and local FFmpeg support is available:
RUN_NATIVE_MEDIA_DECODE_TESTS=1 \
TRANSCRIPTION_MEDIA_PATH=/path/to/video-or-audio-container \
cargo test -p moritzbrantner-audio-analysis-transcription \
--features audio-io \
native_media_decode_when_requested -- --ignored --nocaptureUse diarization,onnx only when caller-owned local ONNX speaker embeddings are
explicitly configured. Add model-bundles when the path is a model-runtime
bundle manifest:
{
"diarization": {
"enabled": true,
"speakerEmbeddingModelBundle": "/path/to/onnx-speaker-model",
"speakerEmbeddingDimension": 192,
"speakerEmbeddingSampleRate": 16000
}
}The ONNX path accepts a direct .onnx file, a directory containing
model.onnx, or a model-runtime manifest when model-bundles is enabled.
It does not resample in this tranche; request audio must already match the
configured speaker embedding sample rate. Speaker model input shape is detected
from ONNX metadata: waveform inputs use [B,S] or [B,1,S], while feature
inputs such as [B,T,80] use the shared speaker fbank/log-mel CPU
preprocessor.
Ignored local transcription ONNX diarization smoke:
RUN_NATIVE_SPEAKER_MODEL_TESTS=1 \
ORT_DYLIB_PATH="$PWD/.audio-tools/whisperx-venv/lib/python3.11/site-packages/onnxruntime/capi/libonnxruntime.so.1.26.0" \
SPEAKER_EMBEDDING_MODEL_BUNDLE=/path/to/onnx-speaker-model \
SPEAKER_EMBEDDING_MODEL_FILE=model.onnx \
DIARIZATION_AUDIO_PATH=/path/to/meeting-16khz.wav \
SPEAKER_EMBEDDING_DIMENSION=192 \
cargo test -p moritzbrantner-audio-analysis-transcription \
--features diarization,onnx,model-bundles \
native_onnx_diarization_smoke_when_requested -- --ignored --nocaptureThe smoke uses direct samples and mock ASR timing, so it validates only the native transcription pipeline’s explicit ONNX diarization path. It does not use FFmpeg, Python, CUDA, pyannote, Hugging Face auth, network, or downloaded model files.
2026-06-10 validation: the checked-in WebM fixture decoded successfully with
audio-io after the ignored smoke harness resolved workspace-root-relative
fixture paths. After local ONNX bundle provisioning, the ONNX diarization smoke
used direct-sample setup and mock ASR timing. Static ONNX metadata inspection
succeeded for the selected current wespeaker-voxceleb-resnet34-LM artifact:
f32 input feats [B,T,80] and f32 output embs [B,256]. Static graph
diagnostics reported default-domain ONNX ops only, opset 14, 110 nodes, and
75 initializers. The speaker adapter supports this feature-input category.
With ORT_DYLIB_PATH unset, file and memory diagnostic load modes timed out
via external timeout 120s exit 124 after onnxSessionBuilder=begin and
before onnxSessionBuilder=ok. Python ONNX Runtime 1.26.0 loaded the same
artifact. With ORT_DYLIB_PATH pointed at the local .audio-tools/whisperx-venv
libonnxruntime.so.1.26.0, the transcription ONNX diarization smoke passed:
direct samples and mock ASR timing reached diarizationRuntime=onnx,
speakerEmbeddingProvider=onnx, speakerEmbeddingDimension=256,
diarizationBaseline=false, and non-empty speaker assignments. ONNX
diarization remains window-by-window embedding, not throughput parity.
Classification: implicit ORT dynamic-library selection/setup blocker on this
host, fixed for local smokes by an explicit compatible dylib.
On the current RTX 3060 Ti smoke host, /usr/local/cuda points at CUDA 13.3
while the passing smoke uses a local CUDA 12.3 library shim at
$SMOKE_ROOT/cuda12-libs. The local smoke
bundle and fixture used there are:
$SMOKE_ROOT/whisper-tiny$SMOKE_ROOT/audio/native-transcription-smoke.wav
Install and configure external compatibility tools outside the default test flow only when using the WhisperX provider:
whisperx --help
ffmpeg -version
python -c 'import whisperx'WhisperX diarization requires a Hugging Face token accepted by pyannote:
export HF_TOKEN=...No default build or test downloads models, requires CUDA, or requires network access. Default tests also do not require Python, WhisperX, Hugging Face tokens, external model files, or FFmpeg.
Optional external WhisperX parity can be run manually when local tools and media are configured:
RUN_WHISPERX_PARITY_TESTS=1 \
WHISPERX_COMMAND=whisperx \
WHISPERX_AUDIO_PATH=/path/to/audio.wav \
cargo test -p moritzbrantner-audio-analysis-transcription external_whisperx_parity_when_requested -- --ignored --nocaptureSet WHISPERX_EXPECTED_JSON=/path/to/expected.json to compare command output
against a known WhisperX JSON fixture. Optional parity-only overrides are
WHISPERX_MODEL, WHISPERX_LANGUAGE, WHISPERX_DEVICE, and
WHISPERX_COMPUTE_TYPE. Set WHISPERX_DIARIZE=1 only when HF_TOKEN is
available.
2026-06-10 validation update: the broken global console entry point was bypassed
with an ignored local venv at .audio-tools/whisperx-venv. Its
bin/whisperx --help command worked, and non-diarization parity passed with
WHISPERX_MODEL=tiny.en, WHISPERX_LANGUAGE=en, WHISPERX_DEVICE=cpu, and
WHISPERX_COMPUTE_TYPE=int8. Token-gated diarization parity was not run because
HF_TOKEN was absent, so pyannote-backed diarization parity remains incomplete.
Token-gated continuation command shape:
test -n "$HF_TOKEN"
RUN_WHISPERX_PARITY_TESTS=1 \
WHISPERX_COMMAND="$PWD/.audio-tools/whisperx-venv/bin/whisperx" \
WHISPERX_MODEL="tiny.en" \
WHISPERX_LANGUAGE="en" \
WHISPERX_DEVICE="cpu" \
WHISPERX_COMPUTE_TYPE="int8" \
WHISPERX_DIARIZE=1 \
WHISPERX_AUDIO_PATH="$SMOKE_ROOT/audio/native-transcription-smoke.wav" \
HF_TOKEN="$HF_TOKEN" \
cargo test --test audio_transcription_native_contracts \
external_whisperx_parity_when_requested -- --ignored --nocapture2026-06-10 token-gated result: setup/auth failure. WhisperX reached pyannote
diarization, but Hugging Face rejected access to
pyannote/speaker-diarization-community-1. Initial non-secret error prefix:
huggingface_hub.errors.GatedRepoError: 403 Client Error; after refreshing the
token, the rerun failed with
huggingface_hub.errors.HfHubHTTPError: 403 Forbidden until the fine-grained
token allowed public gated repositories and access to the pyannote model.
2026-06-11 token-gated result: pass. WhisperX diarization parity completed with
WHISPERX_DIARIZE=1, .audio-tools/whisperx-venv/bin/whisperx, tiny.en,
CPU, int8, and HF_TOKEN="$HF_TOKEN". Pyannote/Hugging Face access was valid
for this run. No Rust parser or contract bug was exposed, and no token value
was documented.
Modules§
- surface
- Library-owned runtime surface for
audio-analysis-transcription.
Structs§
- Aligned
Char - One aligned character timing.
- Aligned
Word - One aligned word timing.
- Alignment
Options - Forced-alignment options.
- Alignment
Request - Forced-alignment request.
- Alignment
Response - Forced-alignment response.
- Alignment
Summary - Alignment summary included in pipeline responses.
- AsrRequest
- ASR provider request.
- AsrResponse
- ASR provider response.
- Candle
Whisper Options - Options for the Candle Whisper native provider.
- Candle
Whisper Transcriber - Feature-gated Candle Whisper provider.
- CtcForced
Aligner - Feature-gated CTC forced aligner.
- Diarization
Options - Diarization options.
- Energy
VadTranscription Provider - Default pure-Rust energy VAD provider.
- Loaded
Audio - Loaded audio for native providers.
- Speaker
Diarization Response - Speaker
Segment Prediction - Speech
Activity Segment - A speech activity span used for ASR chunking.
- Transcription
Artifact - Artifact produced or discovered by a provider.
- Transcription
Output Options - Output preferences for transcription.
- Transcription
Pipeline Request - Request for an audio/video-to-text transcription pipeline.
- Transcription
Pipeline Response - Response from a transcription pipeline.
- Transcription
Provider Plan - Metadata-only provider plan.
- VadOptions
- VAD options used before ASR chunking.
- VadRequest
- VAD request.
- VadResponse
- VAD response.
- Whisper
CppProvider Options - Options for native whisper.cpp compatibility.
- Whisper
CppTranscriber - Native whisper.cpp compatibility provider.
- WhisperX
Command Options - Options for the external WhisperX command provider.
- WhisperX
Command Transcriber - External command provider for Python WhisperX.
Enums§
- Alignment
Interpolation Method - Timestamp interpolation behavior for missing alignment spans.
- Native
Device Preference - Native execution device preference.
- Speaker
Assignment Policy - Speaker assignment policy for transcript words and segments.
- Transcription
Provider Selection - Provider selection.
- Transcription
Source - Source accepted by transcription providers.
- WhisperX
Device - WhisperX device selection.
Traits§
- Audio
Transcription Provider - Trait for audio transcription providers.
- Forced
Alignment Provider - Trait for forced alignment providers.
- Transcript
Diarization Provider - Trait for transcript diarization and speaker assignment providers.
- Transcription
VadProvider - Trait for transcription VAD providers.
Functions§
- candle_
whisper_ provider_ plan - Returns the primary Candle Whisper provider plan.
- import_
whisperx_ json - Parses existing WhisperX JSON without running external tools.
- run_
transcription_ pipeline - Runs the provider-agnostic native transcription pipeline.
- transcribe
- Runs a transcription request with the selected primary provider.
- transcription_
provider_ plans - Returns provider plans.
- whisper_
cpp_ provider_ plan - Returns the whisper.cpp compatibility provider plan.
- whisperx_
provider_ plan - Returns the external WhisperX provider plan.
Type Aliases§
- WhisperX
Options - Backward-compatible name for the external WhisperX provider options.