<p align="center">
<a href="https://crates.io/crates/normy">
<img src="https://img.shields.io/crates/v/normy?style=flat-square" alt="Crates.io">
</a>
<a href="https://docs.rs/normy">
<img src="https://img.shields.io/docsrs/normy?style=flat-square" alt="Docs.rs">
</a>
<a href="https://github.com/okayasl/normy/actions/workflows/ci.yml">
<img src="https://github.com/okayasl/normy/actions/workflows/ci.yml/badge.svg?branch=main" alt="Build Status">
</a>
<a href="LICENSE">
<img src="https://img.shields.io/badge/license-MIT%20OR%20Apache--2.0-blue?style=flat-square" alt="License: MIT OR Apache-2.0">
</a>
</p>
# ๐ฆ Normy
**Ultra-fast, zero-copy text normalization โ built for Rust NLP pipelines & tokenizers**.
Flexible enough for any high-throughput multilingual text processing (search, logs, APIs, data pipelines, โฆ)
Normy delivers **extreme performance** through automatic iterator fusion and precise early-exit checks, while respecting language-specific rules (e.g., Turkish dotted/dotless I, German ร folding).
- **Zero-copy** โ Immediately returns without allocation when input needs no changes.
- **Automatic fusion** โ Can fuse eligible stages (>1 fusable stage) into a single pass for better cache locality.
- **Locale-accurate** โ Built-in rules for correctness across scripts.
- **Format-aware** โ Clean HTML/Markdown while preserving content.
## Why Normy?
Traditional normalizers allocate on every callโeven for clean text. Normy eliminates this overhead:
- On **already-normalized text** (common in production streams): **up to 51ร higher throughput** than HuggingFace `tokenizers` normalizers due to true zero-copy.
- On **text requiring transformation**: **3.7โ4.1ร faster** through fusion and optimized stages.
## ๐ Performance Comparison
Measured against HuggingFace `tokenizers` normalizers on 64 KiB inputs (200 samples each).
### Complex Pipeline Bert-like (Chinese + Strip + Whitespace + NFD + Diacritics + Lowercase)
Already Normalized Text
%27,%27rgba(200,200,200,0.8)%27]}]},options:{legend:{display:false},scales:{xAxes:[{ticks:{beginAtZero:true,max:20,fontSize:9}}],yAxes:[{ticks:{fontSize:9}}]},title:{display:true,text:%2719.3x%20faster%20(100%25%20zero-copy)%27,fontSize:10}}}&width=300&height=80)
Needs Transform
%27,%27rgba(200,200,200,0.8)%27]}]},options:{legend:{display:false},scales:{xAxes:[{ticks:{beginAtZero:true,max:4,fontSize:9}}],yAxes:[{ticks:{fontSize:9}}]},title:{display:true,text:%273.7x%20faster%27,fontSize:10}}}&width=300&height=80)
### Simple Pipeline (French + Lowercase + Transliterate)
Already Normalized Text
%27,%27rgba(200,200,200,0.8)%27]}]},options:{legend:{display:false},scales:{xAxes:[{ticks:{beginAtZero:true,max:55,fontSize:9}}],yAxes:[{ticks:{fontSize:9}}]},title:{display:true,text:%2751.3x%20faster%20(100%25%20zero-copy)%27,fontSize:10}}}&width=300&height=80)
With Accents/Diacritics
%27,%27rgba(200,200,200,0.8)%27]}]},options:{legend:{display:false},scales:{xAxes:[{ticks:{beginAtZero:true,max:5,fontSize:9}}],yAxes:[{ticks:{fontSize:9}}]},title:{display:true,text:%274.1x%20faster%27,fontSize:10}}}&width=300&height=80)
Numbers represent geometric mean over 200 samples. Hardware, OS, and input distribution can affect results.
See **/benches/comparison_tokenizers_bench.rs** for reproducible results.
## ๐พ Installation
Add Normy to your project:
```bash
cargo add normy
```
## โก Quickstart
Normy uses a **fluent builder** pattern with automatic fusion detection.
```rust
use std::error::Error;
use normy::{
COLLAPSE_WHITESPACE_UNICODE, CaseFold, DEU, FRA, JPN, LowerCase, Normy, RemoveDiacritics, SegmentWords,
TUR, Transliterate, UnifyWidth, ZHO,
};
fn main() -> Result<(), Box<dyn Error>> {
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
// TURKISH (Turkic) โ famous for its dotted/dotless I distinction
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
let tur = Normy::builder()
.lang(TUR)
.add_stage(LowerCase) // Critical: ฤฐ โ i, I โ ฤฑ
.build();
println!(
"Turkish : {}",
tur.normalize("KIZILIRMAK NEHRฤฐ TรRKฤฐYE'NฤฐN EN UZUN NEHRฤฐDฤฐR.")?
);
// โ kฤฑzฤฑlฤฑrmak nehri tรผrkiye'nin en uzun nehridir.
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
// GERMAN (Germany/Austria/Switzerland) โ ร and umlaut handling
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
let deu = Normy::builder()
.lang(DEU)
.add_stage(CaseFold) // ร โ ss
.add_stage(Transliterate) // ร โ ae, ร โ oe, ร โ ue
.build();
println!(
"German : {}",
deu.normalize("Grรผรe aus Mรผnchen! Die Straรe ist sehr schรถn.")?
);
// โ gruesse aus muenchen! die strasse ist sehr schoen.
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
// FRENCH (France/Belgium/Canada/etc.) โ classic accented text
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
let fra = Normy::builder()
.lang(FRA)
.add_stage(CaseFold)
.add_stage(RemoveDiacritics) // รฉ โ e, รง โ c, etc.
.build();
println!(
"French : {}",
fra.normalize("Bonjour ! J'adore le cafรฉ et les croissants ร Paris.")?
);
// โ bonjour ! j'adore le cafe et les croissants a paris.
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
// CHINESE (Simplified โ China) โ fullwidth & word segmentation
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
let zho = Normy::builder()
.lang(ZHO)
.add_stage(UnifyWidth)
.add_stage(COLLAPSE_WHITESPACE_UNICODE)
.add_stage(SegmentWords) // unigram segmentation
.build();
println!(
"Chinese : {}",
zho.normalize("ๅไบฌ็็งๅคฉ็นๅซ็พไธฝ๏ผ้ฟๅ้ๅธธๅฃฎ่ง๏ผ")?
);
// โ ๅไบฌ็็งๅคฉ็นๅซ็พไธฝ , ้ฟๅ้ๅธธๅฃฎ่ง !
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
// CHINESE (Simplified โ China) โ fullwidth & word segmentation & unigram cjk
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
let zho = Normy::builder()
.lang(ZHO)
.modify_lang(|le| le.set_unigram_cjk(true))
.add_stage(UnifyWidth)
.add_stage(COLLAPSE_WHITESPACE_UNICODE)
.add_stage(SegmentWords) // unigram segmentation
.build();
println!(
"Chinese(unigram cjk) : {}",
zho.normalize("ๅไบฌ็็งๅคฉ็นๅซ็พไธฝ๏ผ้ฟๅ้ๅธธๅฃฎ่ง๏ผ")?
);
// โ ๅ ไบฌ ็ ็ง ๅคฉ ็น ๅซ ็พ ไธฝ , ้ฟ ๅ ้ ๅธธ ๅฃฎ ่ง !
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
// JAPANESE (Japan) โ script transitions + width unification
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
let jpn = Normy::builder()
.lang(JPN)
.add_stage(UnifyWidth)
.add_stage(COLLAPSE_WHITESPACE_UNICODE)
.add_stage(SegmentWords) // script boundary segmentation
.build();
println!(
"Japanese: {}",
jpn.normalize("ๆฑไบฌใฏๆฌๅฝใซ็ด ๆดใใใ่กใงใ๏ผๆกใใจใฆใใใใใ")?
);
// โ ๆฑไบฌใฏๆฌๅฝใซ็ด ๆดใใใ่กใงใ ! ๆกใใจใฆใใใใ ใ
Ok(())
}
```
When text is already normalized, Normy returns `Cow::Borrowed` โ **zero allocation**.
## โจ Features
| **Zero-Copy** | No allocation on clean input |
| **Iterator Fusion** | Automatic speedup on 2+ fusable stages (monomorphized char iterators) |
| **Locale-Accurate** | Turkish ฤฐ/i, German รโss, Dutch ฤฒโij, Arabic/Hebrew diacritics, etc. |
| **Format-Aware** | Safe HTML/Markdown stripping (preserves `<code>`, fences, attributes) |
| **Composable Pipelines** | Fluent builder + dynamic runtime stages |
| **Segmentation** | Word boundaries for CJK, Indic, Thai, Khmer, etc. (ZWSP insertion) |
| **Extensible** | Implement custom transformation stage |
## ๐ผ Available Normalization Stages
Normy provides a rich set of composable, high-performance normalization stages.
Most stages support **static iterator fusion** for maximum speed (single-pass, zero-copy when possible).
| `CaseFold` | Locale-aware case folding (German รโss, etc.) | Yes |
| `LowerCase` | Locale-aware lowercasing (Turkish ฤฐโi) | Yes |
| `RemoveDiacritics` | Removes combining/spacing diacritics (accents, tone marks, etc.) | Yes |
| `Transliterate` | Language-specific character substitutions (รโae, ะฎโju, etc.) | Yes |
| `NormalizePunctuation` | Normalizes dashes, quotes, ellipsis, bullets, etc. to standard forms | Yes |
| `UnifyWidth` | Converts fullwidth โ halfwidth (critical for CJK compatibility) | Yes |
| `SegmentWords` | Inserts spaces at word/script boundaries (CJK unigram, Indic virama, etc.) | Yes |
| `StripControlChars` | Removes all control characters (Unicode Cc category) | Yes |
| `StripFormatControls` | Removes directional marks, joiners, ZWSP, invisible operators, etc. | Yes |
| **Whitespace Variants** | | |
| โข `COLLAPSE_WHITESPACE` | Collapse consecutive ASCII whitespace โ single space | Yes |
| โข `COLLAPSE_WHITESPACE_UNICODE` | Collapse all Unicode whitespace โ single space | Yes |
| โข `NORMALIZE_WHITESPACE_FULL` | Normalize + collapse + trim all Unicode whitespace | Yes |
| โข `TRIM_WHITESPACE` | Trim leading/trailing ASCII whitespace only | Yes |
| โข `TRIM_WHITESPACE_UNICODE` | Trim leading/trailing Unicode whitespace | Yes |
| **Normalization Forms** | | |
| โข `NFC` | Unicode canonical composed form (most compact, W3C recommended) | **No** |
| โข `NFD` | Unicode canonical decomposed form | **No** |
| โข `NFKC` | Unicode compatibility composed (lossy, e.g. ๏ฌโfi, โโยฐC) | **No** |
| โข `NFKD` | Unicode compatibility decomposed | **No** |
| `StripHtml` | Strips HTML tags and decodes entities (format-aware) | **No** |
| `StripMarkdown` | Removes Markdown formatting while preserving content | **No** |
Key notes
- **Fusion** = static single-pass iterator fusion (zero-copy + minimal allocation when conditions met)
- Non-fusable stages (`NFC`/`NFD`/`NFKC`/`NFKD`, `StripHtml`, `StripMarkdown`) use optimized batch processing and should usually be placed early in the pipeline
## ๐ Supported Languages
| **European** | | |
| Turkish | `TUR` | Custom case rules (ฤฐ/i, I/ฤฑ) |
| German | `DEU` | ร folding, umlauts transliteration |
| Dutch | `NLD` | IJ digraph folding |
| Danish | `DAN` | ร
/ร/ร transliteration |
| Norwegian | `NOR` | ร
/ร/ร transliteration |
| Swedish | `SWE` | ร
/ร/ร transliteration |
| Icelandic | `ISL` | ร/ร/ร transliteration |
| French | `FRA` | ล/ร ligatures, accent handling |
| Spanish | `SPA` | Accent normalization |
| Portuguese | `POR` | Comprehensive diacritics |
| Italian | `ITA` | Grave/acute accents |
| Catalan | `CAT` | ร transliteration |
| Czech | `CES` | Hรกฤek preservation, selective stripping |
| Slovak | `SLK` | Caron handling |
| Polish | `POL` | Ogonek & acute accents |
| Croatian | `HRV` | Digraph normalization |
| Serbian | `SRP` | Cyrillic diacritics |
| Lithuanian | `LIT` | Dot-above vowels |
| Greek | `ELL` | Polytonic diacritics (6 types) |
| Russian | `RUS` | CyrillicโLatin transliteration |
| **Middle Eastern** | | |
| Arabic | `ARA` | 15 diacritic types (tashkeel) |
| Hebrew | `HEB` | 20 vowel points (nikud) |
| **Asian** | | |
| Vietnamese | `VIE` | Tone marks (5 tones ร vowels) |
| Chinese | `ZHO` | Word segmentation(Optional CJK unigram) |
| Japanese | `JPN` | Word segmentation |
| Korean | `KOR` | Word segmentation |
| Thai | `THA` | Tone marks, word segmentation |
| Lao | `LAO` | 15 combining marks, segmentation |
| Khmer | `KHM` | 30+ combining marks, segmentation |
| Myanmar | `MYA` | 17 combining marks, segmentation |
| **South Asian** | | |
| Hindi | `HIN` | Devanagari diacritics, segmentation |
| Bengali | `BEN` | Bengali diacritics, segmentation |
| Tamil | `TAM` | Tamil diacritics, segmentation |
| **Other** | | |
| English | `ENG` | Default/baseline |
**Features Key:**
- **Word Segmentation**: Automatic boundary detection for non-space-delimited scripts
- **CJK Unigram**: Optinal character-level tokenization for Chinese ideographs
- **Transliteration**: ScriptโLatin conversion (e.g., Cyrillic, ligatures)
- **Diacritics**: Intelligent spacing/combining mark handling
## ๐ Documentation
- Full API docs: [docs.rs/normy](https://docs.rs/normy)
- Linguistic rules: `LINGUISTIC_POLICY.md`
- Pipeline guidelines: `PIPELINE_GUIDELINES.md`
- Examples are in the `examples/` directory
- Generate local docs:
```bash
cargo doc --open
```
---
## ๐ค Contributing
Contributions are very welcome! See `CONTRIBUTING.md`
---
## ๐ License
Dual-licensed under **MIT** or **Apache-2.0**, at your option.
See [LICENSE-MIT](LICENSE-MIT) and [LICENSE-APACHE](LICENSE-APACHE).
---
**Normy** โ Ultra-fast, linguistically correct normalization โ the next-generation layer for Rust NLP & tokenizers