# Dynamic Language Detection & Expansion Plan
## 1. Overview
Currently, `TextExpand` is initialized with a static `Language` and processes text character-by-character using expansion tasks tailored exclusively for that language. To support dynamic language switching in a streaming context, we need to introduce a mechanism to heuristically determine the language of incoming text and apply the correct expansion tasks on a per-word basis.
## 2. Core Components
### 2.1. Language-Aware Tokens (`ExpandUnit`)
Every unit processed by the system must carry language metadata, allowing downstream tasks and outputs to unambiguously know which language context applies to that token.
```rust
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ExpandUnit {
Word(String, Language),
Mark(char, Language),
Number(String, Language),
}
```
### 2.2. Refactoring `TextExpand` Struct
Instead of holding a single list of tasks, the expanding state machine will be refactored to hold tasks for all supported languages provided during initialization. It will also maintain the necessary state for the `lingua::LanguageDetector`.
```rust
pub struct TextExpand {
// Tasks mapped by language, initialized from a list of supported languages
tasks_by_lang: HashMap<Language, Vec<Box<dyn ExpandTask>>>,
// Core queues
input_units: VecDeque<ExpandUnit>,
output_units: VecDeque<ExpandUnit>,
// Language Detection State
detector: lingua::LanguageDetector,
current_language: Language,
// A sliding window of recent text to provide context to the detector
context_window: VecDeque<String>,
// Tokenizer Buffer
buffer: String,
buffer_is_number: bool,
}
impl TextExpand {
pub fn new(supported_languages: &[Language], default_language: Language) -> Self {
// ... build lingua detector from supported_languages
// ... build tasks_by_lang from supported_languages
// ...
}
}
```
## 3. The Detection Algorithm (Heuristic)
Single-word language detection is notoriously unreliable (e.g., "a" could be English or Spanish; "chat" could be English or French). To counter this, we use a **sliding window context** and a **hysteresis** threshold.
### 3.1. Buffering & Context
1. **Push to Context**: When a new word is flushed from the tokenizer buffer, add it to `context_window` (keeping a maximum of 3-5 words).
2. **Skip Numbers/Marks**: Numbers and punctuation (`Mark`) do not provide linguistic clues. They simply inherit the `current_language`.
3. **Sentence Boundaries**: Strong punctuation (e.g., `.`, `?`, `!`) should probably clear the `context_window`, as sentences often mark natural boundaries for language switches.
### 3.2. Evaluation & Stickiness (Hysteresis)
When evaluating the context window string using `lingua::LanguageDetector::compute_language_confidence_values()`:
1. Examine the confidence score of `current_language` versus the `top_detected_language`.
2. ONLY switch `current_language` to `top_detected_language` if:
- The languages are different.
- The confidence of `top_detected_language` exceeds `current_language` by a specific `THRESHOLD` (e.g., 0.20 or 20%).
- This "stickiness" prevents random fluttering on ambiguous words.
3. The newly emitted `ExpandUnit` is tagged with whatever `current_language` is resolved.
## 4. The Expansion Algorithm (`try_expand`)
The `try_expand` loop must be updated to route units to the correct language's expansion tasks.
### 4.1. Routing
For each iteration over `input_units`:
1. Look at the `Language` of the front unit.
2. Fetch the corresponding task list from `tasks_by_lang`.
3. Try expanding using only those tasks.
### 4.2. Output Inheritance
When a task returns an `ExpandResult::Replace` (e.g., expanding "12" to "twelve" or "mười hai"):
- The task outputs new raw `ExpandUnit`s.
- The state machine must intercept these new units and tag them with the same `Language` that triggered the replacement.
- This ensures that a Vietnamese task expanding "12" produces tokens that stay tagged as Vietnamese when they circle back into the queue.
## 5. Next Steps
1. Refactor `ExpandUnit` to include the `Language` variant.
2. Update the `ExpandTask` trait and all implementations to expect and produce language-tagged units.
3. Create `TextExpand` with the `lingua` dependency, testing the sliding window logic in isolation.
4. Update `try_expand` to use task routing and language inheritance.