# Dynamic Language Detection — Implementation Plan
> Supersedes `docs/dynamic_language.md` with concrete, step-by-step implementation details.
## Design Decision: Parallel Language Queue
The original doc proposes adding `Language` fields to `ExpandUnit` variants. We choose a **parallel language queue** instead — `ExpandUnit` stays unchanged, `TextExpand` tracks language in a parallel `VecDeque<Language>` and returns `(ExpandUnit, Language)` tuples. This means **zero changes to any `ExpandTask` implementation** (all 9 tasks across EN + VI).
---
## PR 1: Internal Plumbing (no behavior change, no new dependency)
All existing tests must pass identically. Single-language behavior is unchanged.
### 1.1 `text_expand.rs` — Struct & Constructor
```rust
pub struct TextExpand {
tasks_by_lang: HashMap<Language, Vec<Box<dyn ExpandTask>>>,
// Language detection (None = single-language, skip detection)
detector: Option<Box<dyn LanguageDetector>>,
current_language: Language,
context_window: VecDeque<String>, // max 5 words
// Parallel language tracking (same length as input_units / output_units)
input_units: VecDeque<ExpandUnit>,
input_langs: VecDeque<Language>,
output_units: VecDeque<(ExpandUnit, Language)>,
// Tokenizer state (unchanged)
buffer: String,
buffer_is_number: bool,
}
```
> **Why `Option<Box<dyn LanguageDetector>>`?** We define a small internal trait so PR 1 compiles without `lingua`. PR 3 plugs in the real detector.
Constructors:
```rust
/// Single-language (backward compat). No detection overhead.
pub fn with_language(language: Language) -> Self {
let mut tasks_by_lang = HashMap::new();
tasks_by_lang.insert(language, get_tasks_for_language(language));
Self {
tasks_by_lang,
detector: None,
current_language: language,
context_window: VecDeque::new(),
input_units: VecDeque::new(),
input_langs: VecDeque::new(),
output_units: VecDeque::new(),
buffer: String::new(),
buffer_is_number: false,
}
}
/// Multi-language with detection.
pub fn with_languages(
languages: &[Language],
default_language: Language,
detector: Box<dyn LanguageDetector>,
) -> Self { ... }
/// Test-only: multi-language without detection (uses default for everything).
pub fn new(tasks: Vec<Box<dyn ExpandTask>>) -> Self // keep for tests
```
### 1.2 `push` / `finish` Signature Change
```rust
// Before:
pub fn push(&mut self, ch: char) -> Option<ExpandUnit>
pub fn finish(&mut self) -> Option<ExpandUnit>
// After:
pub fn push(&mut self, ch: char) -> Option<(ExpandUnit, Language)>
pub fn finish(&mut self) -> Option<(ExpandUnit, Language)>
```
### 1.3 `flush_buffer` — Language Assignment
```rust
fn flush_buffer(&mut self) {
if self.buffer.is_empty() { return; }
let content = std::mem::take(&mut self.buffer);
let lang = if self.buffer_is_number {
self.current_language // numbers inherit
} else {
self.detect_language(&content) // words get detected
};
let unit = if self.buffer_is_number {
ExpandUnit::Number(content)
} else {
ExpandUnit::Word(content)
};
self.input_units.push_back(unit);
self.input_langs.push_back(lang);
}
```
In `process_char`, marks also inherit:
```rust
self.input_units.push_back(ExpandUnit::Mark(ch));
self.input_langs.push_back(self.current_language);
// Sentence boundaries clear context window
}
```
**PR 1 stub:** `detect_language` just returns `self.current_language` (no actual detection yet).
### 1.4 `try_expand` — Task Routing + Language Inheritance
```rust
fn try_expand(&mut self, is_final: bool) {
'outer: while !self.input_units.is_empty() {
let front_lang = self.input_langs[0];
let tasks = self.tasks_by_lang.get(&front_lang)
.map(|v| v.as_slice())
.unwrap_or(&[]);
for task in tasks {
match task.expand(&self.input_units) {
Some(ExpandResult::Maybe) => {
if !is_final { break 'outer; }
}
Some(ExpandResult::Replace(n, new_units)) => {
debug_assert!(n > 0);
// Pop n units + langs
for _ in 0..n {
self.input_units.pop_front();
self.input_langs.pop_front();
}
// Prepend replacements with INHERITED language
for unit in new_units.into_iter().rev() {
self.input_units.push_front(unit);
self.input_langs.push_front(front_lang);
}
continue 'outer;
}
None => {}
}
}
// No task matched — emit with language
if let Some(unit) = self.input_units.pop_front() {
let lang = self.input_langs.pop_front()
.unwrap_or(self.current_language);
self.output_units.push_back((unit, lang));
}
}
}
```
**Invariant:** `input_units.len() == input_langs.len()` must always hold.
### 1.5 `TextUnit::Word` Gets Language
```rust
// Before:
pub enum TextUnit {
Word(String),
Space,
ClauseBoundary(char),
Punctuation(char),
}
// After:
pub enum TextUnit {
Word(String, Language),
Space,
ClauseBoundary(char),
Punctuation(char),
}
```
Replace `From<ExpandUnit> for TextUnit` with:
```rust
impl TextUnit {
pub fn from_expand_unit(unit: ExpandUnit, language: Language) -> Self {
match unit {
ExpandUnit::Word(s) | ExpandUnit::Number(s) => TextUnit::Word(s, language),
ExpandUnit::Mark(c) if c.is_whitespace() => TextUnit::Space,
ExpandUnit::Mark(c) if matches!(c, ',' | '.' | '!' | '?' | ';' | ':') => {
TextUnit::ClauseBoundary(c)
}
ExpandUnit::Mark(c) => TextUnit::Punctuation(c),
}
}
}
```
### 1.6 `semantic.rs` — Update Pattern Match
```rust
impl SentenceUnit {
pub fn from_text_unit(
unit: TextUnit,
phonemizer: &WordPhonemizer,
) -> crate::error::Result<Self> {
match unit {
TextUnit::Word(word, _lang) => {
// Caller picks the right phonemizer; we just destructure
Ok(SentenceUnit::Word(phonemizer.phonemize_word(&word)?))
}
TextUnit::Space => Ok(SentenceUnit::Space),
TextUnit::ClauseBoundary(ch) => Ok(SentenceUnit::ClauseBoundary(ch)),
TextUnit::Punctuation(ch) => Ok(SentenceUnit::Punctuation(ch)),
}
}
}
```
### 1.7 Callers Update
**`g2p/full.rs`:**
```rust
// TextUnit::from(unit) → TextUnit::from_expand_unit(unit, lang)
if let Some((unit, lang)) = expander.push(ch) {
let text_unit = TextUnit::from_expand_unit(unit, lang);
let su = SentenceUnit::from_text_unit(text_unit, &self.word_phonemizer)?;
sentence_units.push(su);
}
```
**`g2p/streaming.rs`:** Same pattern.
**`tests/common/mod.rs`:**
```rust
pub fn collect_units(text: &str) -> Vec<TextUnit> {
let mut expander = TextExpand::new(vec![]);
let mut units = Vec::new();
for ch in text.chars() {
if let Some((unit, lang)) = expander.push(ch) {
units.push(TextUnit::from_expand_unit(unit, lang));
}
}
while let Some((unit, lang)) = expander.finish() {
units.push(TextUnit::from_expand_unit(unit, lang));
}
units
}
```
Tests that match `TextUnit::Word(s)` become `TextUnit::Word(s, _)`.
### 1.8 Internal Tests Update
`text_expand.rs` unit tests (`run_test`, `test_text_expand_cases_en/vi`) need updating to handle `(ExpandUnit, Language)` return type. The expected values remain the same; just unwrap the tuple.
---
## PR 2: Multi-Language G2P Layer
### 2.1 `FullG2p`
```rust
pub struct FullG2p {
phonemizers: HashMap<Language, WordPhonemizer>,
sentence_upgrades: HashMap<Language, FullSentencePhonemeUpgrade>,
default_language: Language,
languages: Vec<Language>,
}
impl FullG2p {
/// Single-language (backward compat)
pub fn new(language: Language) -> Result<Self> {
Self::with_languages(&[language], language)
}
/// Multi-language
pub fn with_languages(languages: &[Language], default: Language) -> Result<Self> {
let mut phonemizers = HashMap::new();
let mut sentence_upgrades = HashMap::new();
for &lang in languages {
phonemizers.insert(lang, WordPhonemizer::new(lang)?);
sentence_upgrades.insert(lang, FullSentencePhonemeUpgrade::new(lang)?);
}
Ok(Self { phonemizers, sentence_upgrades, default_language: default, languages: languages.to_vec() })
}
}
```
The `g2p` method picks the right phonemizer per word:
```rust
if let Some((unit, lang)) = expander.push(ch) {
let text_unit = TextUnit::from_expand_unit(unit, lang);
let phonemizer = &self.phonemizers[&lang];
let su = SentenceUnit::from_text_unit(text_unit, phonemizer)?;
sentence_units.push(su);
}
```
Use `default_language` for sentence upgrade (prosody). This is safe because:
- Vietnamese stress is handled per-word via `WordPhoneme.language` in the renderer
- English stress promotion only applies to English words
### 2.2 `StreamingG2P`
Same pattern: `HashMap<Language, WordPhonemizer>`, per-word lookup.
### 2.3 Renderer Fix (`sentence_upgrade/mod.rs`)
The `Renderer` currently uses `self.language` for Vietnamese-specific logic. Fix to use per-word language:
```rust
// Line ~302 — change:
if self.language == Language::Vietnamese {
// To:
if word.language == Language::Vietnamese {
// Line ~316 — change:
self.language == Language::English,
// To:
word.language == Language::English,
```
Also: `phdata.select_table_by_name(word.language.as_str())` must be called per word in the renderer, same pattern as `tests/common/mod.rs:57`.
### 2.4 Tests
- All existing single-language tests pass (regression)
- New test: `FullG2p::with_languages(&[EN, VI], EN)` phonemizes English text correctly
- New test: `FullG2p::with_languages(&[EN, VI], VI)` phonemizes Vietnamese text correctly
- No mixed-text test yet (detection isn't wired)
---
## PR 3: `lingua` Integration
### 3.1 Dependency
```toml
[dependencies]
lingua = { version = "1.6", default-features = false, features = ["english", "vietnamese"] }
```
### 3.2 Internal LanguageDetector Trait
```rust
// In text_expand.rs (or a new file src/lang_detect.rs)
pub(crate) trait LanguageDetector: Send + Sync {
fn detect(&self, context: &str) -> Option<(Language, f64)>;
}
```
### 3.3 `lingua` Implementation
```rust
pub(crate) struct LinguaDetector {
detector: lingua::LanguageDetector,
}
impl LinguaDetector {
pub fn new(languages: &[Language]) -> Self {
let lingua_langs: Vec<lingua::Language> = languages.iter()
.map(|l| match l {
Language::English => lingua::Language::English,
Language::Vietnamese => lingua::Language::Vietnamese,
})
.collect();
let detector = lingua::LanguageDetectorBuilder::from_languages(&lingua_langs)
.with_minimum_relative_distance(0.25)
.build();
Self { detector }
}
}
impl LanguageDetector for LinguaDetector {
fn detect(&self, context: &str) -> Option<(Language, f64)> {
let confidences = self.detector.compute_language_confidence_values(context);
confidences.first().map(|c| {
let lang = match c.language() {
lingua::Language::English => Language::English,
lingua::Language::Vietnamese => Language::Vietnamese,
_ => Language::English,
};
(lang, c.value())
})
}
}
```
### 3.4 Detection Algorithm in `TextExpand`
```rust
const CONTEXT_WINDOW_SIZE: usize = 5;
const HYSTERESIS_THRESHOLD: f64 = 0.20;
fn detect_language(&mut self, word: &str) -> Language {
let detector = match &self.detector {
Some(d) => d,
None => return self.current_language, // single-language fast path
};
// Update context window
self.context_window.push_back(word.to_string());
if self.context_window.len() > CONTEXT_WINDOW_SIZE {
self.context_window.pop_front();
}
let context: String = self.context_window.iter()
.map(|s| s.as_str())
.collect::<Vec<_>>()
.join(" ");
if let Some((top_lang, top_confidence)) = detector.detect(&context) {
if top_lang != self.current_language
&& top_confidence > 0.5 + HYSTERESIS_THRESHOLD
{
self.current_language = top_lang;
}
}
self.current_language
}
```
### 3.5 Wire It Up
```rust
pub fn with_languages(languages: &[Language], default: Language) -> Self {
let mut tasks_by_lang = HashMap::new();
for &lang in languages {
tasks_by_lang.insert(lang, get_tasks_for_language(lang));
}
let detector = LinguaDetector::new(languages);
Self {
tasks_by_lang,
detector: Some(Box::new(detector)),
current_language: default,
// ...
}
}
```
### 3.6 Tests
```rust
#[test]
fn detects_english_words() {
let mut expander = TextExpand::with_languages(
&[Language::English, Language::Vietnamese],
Language::English,
);
// Push "hello world" → both words detected as English
}
#[test]
fn detects_vietnamese_words() {
// Push "xin chào bạn" → all words detected as Vietnamese
}
#[test]
fn switches_language_mid_sentence() {
// Push "Hello, tôi tên là John" → EN, VI, VI, VI, EN
}
#[test]
fn numbers_inherit_current_language() {
// Push "giá 100 đồng" → 100 inherits Vietnamese
}
#[test]
fn sentence_boundary_resets_context() {
// Push "Xin chào. Hello world" → VI then EN after boundary
}
```
### 3.7 E2E Fixtures
Add `tests/fixtures/mixed.jsonl` with mixed EN/VI sentences and expected phoneme output. Test with `FullG2p::with_languages`.
---
## PR 4: Polish
- **Feature gate:** `lingua` behind cargo feature `lang-detect` (optional)
- **Benchmarks:** Measure detection overhead per word
- **Threshold tuning:** Empirical testing with real mixed-text TTS inputs
- **Doc update:** Update `CLAUDE.md` architecture section, remove `docs/dynamic_language.md`
---
## Risks & Mitigations
| `lingua` binary size (~2-5 MB for 2 langs) | Larger binary | Feature-gate in PR 4 |
| Short ambiguous words ("a", "la") | Wrong language detection | Context window + hysteresis |
| `PhonemeData` table switching per word | Performance regression | Profile; tables are small lookups |
| `promote_clauses` with mixed phonemes | Corrupted stress bytes | Only promote English words (check `word.language`) |
| Breaking `push/finish` signature | All callers must update | PR 1 does this atomically |
| Parallel queue invariant violation | Panic/wrong language | `debug_assert!(input_units.len() == input_langs.len())` |
---
## File Change Summary
| `Cargo.toml` | — | — | add `lingua` |
| `src/text_expand.rs` | struct, push/finish, try_expand, flush_buffer, TextUnit | — | detect_language, with_languages |
| `src/semantic.rs` | TextUnit::Word pattern | — | — |
| `src/g2p/full.rs` | caller update | HashMap phonemizers, with_languages | — |
| `src/g2p/streaming.rs` | caller update | HashMap phonemizers, with_languages | — |
| `src/sentence_upgrade/mod.rs` | — | Renderer per-word language | — |
| `src/expand_tasks/**` | — | — | — |
| `tests/common/mod.rs` | caller update | — | — |
| `tests/e2e.rs` | — | — | mixed fixtures |