1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
//! The shared VAD inference-result struct, ported from
//! [`mlx_audio.vad.models.silero_vad.silero_vad.VADOutput`][vad-output].
//!
//! Every VAD architecture mlx-audio ships exposes its
//! `Model.generate(audio, …)` result as this 3-field bundle: the speech
//! timestamps (start/end pairs), the per-frame speech probabilities, and
//! the inference sample rate. The struct is reproduced verbatim here so a
//! per-architecture VAD model (silero_vad, sortformer / diarization,
//! smart_turn endpoint, …) can return one [`VadOutput`] that the
//! downstream caller can consume uniformly (the [`VoicePipeline`-style
//! consumer][sts-pipeline] mlx-audio's `sts/voice_pipeline.py` builds).
//!
//! [vad-output]: https://github.com/Blaizzy/mlx-audio/blob/main/mlx_audio/vad/models/silero_vad/silero_vad.py#L21-L25
//! [sts-pipeline]: https://github.com/Blaizzy/mlx-audio/blob/main/mlx_audio/sts/voice_pipeline.py
use crateArray;
/// One speech segment in a [`VadOutput`] — the start / end pair mlx-audio
/// emits as `{"start": int, "end": int}` dictionaries
/// ([silero_vad.py:163-176][vad-segment]).
///
/// `start` and `end` are sample indices into the input waveform (the
/// `return_seconds=False` path; mlx-audio's `return_seconds=True` path
/// multiplies by `1/sample_rate` — that conversion is left to the
/// caller). `start < end` by construction; an empty / silent input yields
/// an empty `timestamps` vector.
///
/// [vad-segment]: https://github.com/Blaizzy/mlx-audio/blob/main/mlx_audio/vad/models/silero_vad/silero_vad.py#L163-L176
/// The result of one VAD inference pass — port of
/// [`mlx_audio.vad.models.silero_vad.silero_vad.VADOutput`][vad-output].
///
/// Faithful 1:1 of mlx-audio's 3-field dataclass:
///
/// - `timestamps: List[dict]` → [`VadOutput::timestamps`] as
/// `Vec<SpeechSegment>` (the `{"start": …, "end": …}` dictionaries are
/// spelled as a typed [`SpeechSegment`] here rather than free-form
/// maps — see the per-`segment` doc).
/// - `probabilities: mx.array` → [`VadOutput::probabilities`] as an
/// [`Array`] (the same `(n_frames,)` shape mlx-audio's
/// `_predict_proba_array` returns).
/// - `sample_rate: int` → [`VadOutput::sample_rate`] as `u32` (the input
/// waveform's sample rate; matches mlx-audio's `int`).
///
/// [`Array`] is `!Send`, so this struct is `!Send` — matching every
/// other audio-domain struct in mlxrs (`crate::audio::stt`'s
/// `EncoderState`, the `crate::lm::generate::GenStep` envelope, …).
///
/// Serde lifecycle: only the typed [`VadOutput::timestamps`] +
/// [`VadOutput::sample_rate`] fields are derivable (the [`Array`]
/// probabilities are a backend handle and cannot be Serde'd directly);
/// the [`SpeechSegment`] type ships full serde derives so a caller that
/// only needs the timestamps (the common `VoicePipeline` consumer) can
/// round-trip them without touching the array.
///
/// [vad-output]: https://github.com/Blaizzy/mlx-audio/blob/main/mlx_audio/vad/models/silero_vad/silero_vad.py#L21-L25