1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
//! The architecture-agnostic [`TtsModel`] seam for `mlxrs::audio::tts` — the
//! text-to-speech analogue of [`crate::audio::stt::model::Model`], mirroring
//! mlx-audio's TTS model surface (the per-model `Model.generate` shape every
//! `tts/models/*` architecture exposes — kokoro, csm, bark, qwen3-tts, …)
//! and mlx-audio-swift's [`SpeechGenerationModel`][swift-gen] protocol.
//!
//! Per the project's no-per-model-arch rule
//! (`project_no_per_model_arch_porting`), mlxrs ships **no
//! concrete TTS model implementations**: the per-model token decoder +
//! vocoder / codec (kokoro's istftnet decoder, csm's RVQ + mimi codec,
//! bark's coarse/fine transformers, …) live in user code on top of this
//! trait. [`TtsModel`] is the *shape* per-model code must conform to so the
//! [`crate::audio::tts::generate::tts_generate`] driver can synthesize from
//! any architecture uniformly — the same "trait + generic loop" seam the
//! [`crate::audio::stt`] STT loop and the [`crate::lm::generate`] LM loop
//! use.
//!
//! ## What the trait abstracts
//!
//! mlx-audio's TTS architectures differ wildly internally (autoregressive
//! token LMs + neural codecs vs. non-autoregressive duration-predictor +
//! iSTFT vocoders vs. diffusion decoders), but their **public synthesis
//! contract** is uniform: `text → list/stream of audio chunks`, each chunk a
//! span of `f32` PCM samples in `[-1, 1]` at the model's
//! [`TtsModel::sample_rate`]. mlx-audio expresses one chunk per *text
//! segment* (kokoro's `split_pattern` split) and, under `stream=True`, one
//! chunk per *streaming interval* (the per-model `streaming_token_interval`
//! cadence). mlxrs mirrors that with a single
//! [`TtsModel::synthesize_segment`] hook the
//! [`super::generate::tts_generate`] driver calls once per
//! [`super::generate::TtsSegment`].
//!
//! [swift-gen]: https://github.com/Blaizzy/mlx-audio-swift/blob/main/Sources/MLXAudioTTS/Generation.swift
use crate::;
use ;
/// A text-to-speech model: the architecture-agnostic seam every concrete TTS
/// architecture (kokoro, csm, bark, qwen3-tts, …) implements so
/// [`super::generate::tts_generate`] can synthesize from it uniformly.
///
/// Mirrors mlx-audio's per-model `Model.generate` shape and
/// mlx-audio-swift's [`SpeechGenerationModel`][swift-gen] protocol: the
/// per-model token decoder + vocoder is wired behind
/// [`TtsModel::synthesize_segment`]; the driver composes text segmentation,
/// audio-chunk assembly, and the streaming-chunk envelope around it.
///
/// - `&self` everywhere — weights are immutable after load, so TTS synthesis
/// never needs `&mut` on the model (matching mlx-audio's `nn.Module` for
/// inference, and the same `&self` choice
/// [`crate::audio::stt::model::Model`] makes). One model can back many
/// concurrent synthesis runs.
/// - [`TtsModel::synthesize_segment`] runs **once per text segment** — the
/// mlx-audio per-model `generate` loop's `for segment_idx, … in
/// enumerate(pipeline(text, …))` body (kokoro `kokoro.py`, llama
/// `llama.py`). The driver handles splitting `text` into
/// [`TtsSegment`]s and assembling the per-segment outputs.
///
/// [swift-gen]: https://github.com/Blaizzy/mlx-audio-swift/blob/main/Sources/MLXAudioTTS/Generation.swift