๐งน newline-normalizer ๐งน
Rust crate for normalizing text into Unix (\n) or DOS (\r\n) newline formats, using fast SIMD search and zero-copy when possible.
โจ Features
- Adds extension traits to
strโ call.to_unix_newlines()and.to_dos_newlines()directly. - Preserves input with
Cow<str>โ skips allocation if no changes are needed. - Converts
\rand\r\ninto consistent Unix (\n) or DOS (\r\n) newlines. - Unicode-safe โ preserves all characters without loss.
- Fast scanning with memchr and SIMD.
๐ Examples
use ;
let unix = "line1\r\nline2\rline3".to_unix_newlines;
assert_eq!;
let dos = "line1\nline2\nline3".to_dos_newlines;
assert_eq!;
๐ Benchmark
Benchmarks are in the /benches folder.
Run them using:
cargo bench --bench to_unix
cargo bench --bench to_dos
All suggestions on how to improve the benchmarks are welcome.
๐ Results
Hardware: AMD Ryzen 9 9900X 12-Core Processor with 64 GB RAM.
Normalizing to DOS newlines (\r\n):
| Case | newline-converter |
This crate (newline_normalizer) |
|---|---|---|
| Small Unicode paragraph | ~685.46 ns | ~88.789 ns ๐ |
| Small Unicode paragraph pre-normalized | ~151.39 ns | ~58.350 ns ๐ |
| The Adventures of Sherlock Holmes (608kb) | ~345.27 ยตs | ~138.26 ยตs ๐ |
| The Adventures of Sherlock Holmes (608kb) pre-normalized | ~342.91 ยตs | ~137.54 ยตs ๐ |
Note: Pre-normalized means the input already has correct line endings and does not require changes.
Normalizing to Unix newlines (\n):
| Case | newline-converter |
std replace chain |
This crate (newline_normalizer) |
|---|---|---|---|
| Small Unicode paragraph | ~1.1009 ยตs | ~140.72 ns | ~24.464 ns ๐ |
| Small Unicode paragraph pre-normalized | ~203.24 ns | ~109.17 ns | ~4.6608 ns ๐ |
| The Adventures of Sherlock Holmes (608kb) | ~779.06 ยตs | ~213.23 ยตs | ~89.150 ยตs ๐ |
| The Adventures of Sherlock Holmes (608kb) pre-normalized | ~365.45 ยตs | ~137.74 ยตs | ~2.7538 ยตs ๐ |
Benchmark result notes
- Pre-normalized means the input text already uses the correct line endings.
- In such cases,
newline_normalizercan skip allocations and return a borrowed reference. - Extremely low latency (e.g., ~4.66 ns) is achieved by using
Cow::Borrowed, avoiding an allocation of a new string when the input does not change.
๐ค Unicode behavior
This crate does not alter Unicode content. It only rewrites newline boundaries.
All valid UTF-8 sequences are preserved, including:
- Combining characters stay attached
- Emoji and multi-codepoint sequences remain valid
- Right-to-left (RTL) markers are unaffected
โ ๏ธ Limitations
This crate does not currently normalize U+2028 (LINE SEPARATOR) or U+2029 (PARA SEP). Only ASCII newline formats are converted.
๐ Licensed under MIT
This project is licensed under the MIT License.