uniworld 0.2.0

Correct Unicode text handling for every script: bidi, line breaking, segmentation, normalization
Documentation
# Script-specific usage and tests

This directory is for **per-script usage notes and tests** that complement the main API docs and the Unicode showcase file.

The primary “live” showcase document is:

- `docs/UniWorld_Unicode_Showcase_TEST_OUTPUT.md`

That file contains representative text samples for many writing systems. The notes here describe how to use those samples when testing UniWorld.

## 1. Tier 1 scripts and UniWorld

Tier 1 scripts (for which UniWorld aims for especially robust behavior) include:

- **Latin and extensions** (accents, ligatures, compatibility forms).
- **Greek and Cyrillic** (case mapping, final sigma).
- **Right-to-left scripts**: Arabic, Hebrew (bidi and cursor movement).
- **Indic and Brahmic scripts**: Devanagari, Bengali, Gurmukhi, Tamil, Sinhala, etc. (conjuncts, virama handling).
- **Southeast Asian no-space scripts**: Thai, Lao, Khmer, Myanmar (dictionary-based line breaking).
- **CJK**: Chinese, Japanese, Korean (full-width vs ASCII width, segmentation).
- **Emoji and symbols**: ZWJ sequences, flags, skin tones, box drawing.

## 2. How to test a script with UniWorld

For any script sample from the showcase file:

1. **Segmentation**  
   - Run grapheme, word, and sentence boundary functions on the sample.
   - Verify expected clustering (no broken emoji ZWJ sequences; Indic conjuncts intact; regional indicator pairs as single clusters).

2. **Line breaking**  
   - Apply `line_break_opportunities` / `line_break_opportunities_with_dictionary` (where relevant).
   - Confirm that line breaks avoid splitting inside grapheme clusters and respect dictionary-based segmentation for Thai/Lao/Khmer/Myanmar.

3. **Normalization**  
   - Compare NFC vs NFD vs NFKC vs NFKD on samples with combining marks and compatibility characters (ligatures, fractions).
   - Ensure canonically equivalent strings compare equal after normalization.

4. **Width and truncation**  
   - Use `display_width` and `truncate_display_width` on CJK/emoji-rich strings to check visual truncation in terminal-like contexts.

5. **Cursor and selection**  
   - For mixed BiDi samples, test both logical and visual cursor movement and word selection.

These patterns apply across all Tier 1 scripts, with the showcase document providing the concrete strings to use.

## 3. Future per-script docs

If needed, additional markdown files can be added here, such as:

- `latin.md` — details on normalization and ligatures.
- `rtl.md` — bidi pitfalls and examples.
- `indic.md` — conjunct clusters, virama rules, and cursor behavior.
- `se_asian.md` — dictionary-based line breaking examples and expected breaks.

For Phase 3, the combination of `docs/UniWorld_Unicode_Showcase_TEST_OUTPUT.md` and this overview is sufficient to guide script-focused testing and documentation.