Expand description
Phoneme timing extraction from ONNX model duration output.
VITS models optionally output a durations tensor [1, phoneme_length]
containing the number of frames (hop_length-sized) each phoneme occupies.
This module converts frame counts to millisecond timestamps.
Structs§
- Phoneme
Timing Info - Timing information for a single phoneme
- Timing
Result - Complete timing result for a synthesized utterance
Constants§
- DEFAULT_
HOP_ LENGTH - Default hop length for VITS models
Functions§
- durations_
to_ timing - Convert duration tensor output to timing information.