normalized_path/lib.rs
1//! Opinionated cross-platform, optionally case-insensitive path normalization.
2//!
3//! This crate provides [`PathElementCS`] (case-sensitive), [`PathElementCI`]
4//! (case-insensitive), and [`PathElement`] (runtime-selected) -- types that take a
5//! raw path element name, validate it, normalize it to a canonical form, and compute
6//! an OS-compatible presentation form.
7//!
8//! # Design goals and non-goals
9//!
10//! **Goals:**
11//!
12//! - The normalization procedure is identical on every platform -- the same input
13//! always produces the same normalized bytes regardless of the host OS.
14//! - If any supported OS considers two names equivalent (e.g. NFC vs NFD on macOS),
15//! they must normalize to the same value.
16//! - The normalized form is always in NFC (Unicode Normalization Form C), the
17//! most widely used and compact canonical form.
18//! - Normalization is idempotent: normalizing an already-normalized name always
19//! produces the same name unchanged.
20//! - The OS-compatible form of a name, when normalized again, produces the same
21//! normalized value as the original input (round-trip stability).
22//! - Every valid name is representable on every supported OS. Characters that
23//! would be rejected or silently altered (Windows forbidden characters, C0 controls)
24//! are mapped to visually similar safe alternatives.
25//! - If the OS automatically transforms a name (e.g. NFC↔NFD conversion,
26//! truncation at null bytes), normalizing the transformed name produces the
27//! same result as normalizing the original.
28//! - In case-insensitive mode, names differing only in case normalize identically,
29//! with correct handling of edge cases like Turkish dotted/dotless I.
30//!
31//! **Non-goals:**
32//!
33//! - Not every name that a particular OS accepts is considered valid. Non-UTF-8
34//! byte sequences, names that normalize to empty (e.g. whitespace-only), and
35//! names that normalize to `.` or `..` (e.g. `" .. "`) are always rejected.
36//! - A name taken directly from the OS may produce a different OS-compatible form
37//! after normalization. For example, a file named `" hello.txt"` (leading space)
38//! will have the space trimmed, so its OS-compatible form is `"hello.txt"`.
39//! - The OS-compatible form is not guaranteed to be accepted by the OS. For
40//! example, it may exceed the OS's path element length limit, or on Apple
41//! platforms the filesystem may require names in Unicode Stream-Safe Text Format
42//! which the OS-compatible form does not enforce.
43//! - Windows 8.3 short file names (e.g. `PROGRA~1`) are not handled.
44//! - Visually similar names are not necessarily considered equal. For example,
45//! a regular space (U+0020) and a non-breaking space (U+00A0) produce different
46//! normalized forms despite looking identical, and the ligature `fi` (U+FB01) is
47//! distinct from the two-character sequence `fi`.
48//! - Fullwidth and ASCII variants of the same character (e.g. `A` vs `A`) are
49//! deliberately normalized to the same form. Users who need to distinguish
50//! them cannot use this crate.
51//! - In case-insensitive mode, Turkish İ (U+0130), dotless ı (U+0131), and
52//! ASCII I/i are all deliberately normalized to the same form. Users who
53//! need to distinguish them cannot use case-insensitive mode.
54//! - Path separators and multi-component paths are not handled. This crate
55//! operates on a single path element (one name between separators). Support
56//! for full paths may be added in a future version.
57//! - Android versions before 6 (API level 23) are not supported. Earlier
58//! versions used Java Modified UTF-8 for filesystem paths, encoding
59//! supplementary characters as CESU-8 surrogate pairs.
60//!
61//! # Normalization pipeline
62//!
63//! Every path element name goes through the following steps during construction:
64//!
65//! 0. **Byte decoding** (only for `from_bytes`/`from_os_str`) --
66//! [`String::from_utf8_lossy()`] is applied, replacing invalid byte sequences
67//! with U+FFFD. Invalid bytes can be encountered on Unix filesystems, which
68//! allow arbitrary bytes except `/` and `\0` in names, and on Windows, where
69//! filenames are WTF-16 and may contain unpaired surrogates.
70//!
71//! 1. **NFD decomposition** -- canonical decomposition to reorder combining marks.
72//! This is needed because macOS stores filenames in a form close to NFD, so an
73//! NFD input and an NFC input must produce the same result. Decomposing first
74//! ensures combining marks are in canonical order before subsequent steps.
75//!
76//! 2. **Whitespace trimming** -- strips leading and trailing characters with the Unicode
77//! `White_Space` property, plus the BOM (U+FEFF) and Control Pictures that correspond
78//! to whitespace control characters (U+2409--U+240D: HT, LF, VT, FF, CR).
79//! Many applications strip leading/trailing whitespace silently, and macOS
80//! automatically strips leading BOMs. Control Pictures are
81//! included because they are the mapped form of whitespace control characters
82//! (see step 4), so trimming must be consistent before and after mapping.
83//!
84//! 3. **Fullwidth-to-ASCII mapping** -- maps fullwidth forms (U+FF01--U+FF5E) to their
85//! ASCII equivalents (U+0021--U+007E). The Windows OS-compatibility step (see below)
86//! maps certain ASCII characters to fullwidth to avoid Windows restrictions. This
87//! step ensures that the OS-compatible form normalizes back to the same value.
88//!
89//! 4. **Control character mapping** -- maps C0 controls (U+0001--U+001F) and DEL (U+007F)
90//! to their Unicode Control Picture equivalents (U+2401--U+241F, U+2421). Control
91//! characters are invisible, can break terminals and tools, and some OSes reject
92//! or silently drop them. Mapping to visible Control Pictures preserves the
93//! information while making the name safe. (Null bytes are excluded — see step 5.)
94//!
95//! 5. **Validation** -- rejects empty strings, `.`, `..`, names containing `/`, and
96//! names containing null bytes (`\0`). These are universally special on all OSes
97//! and cannot be used as regular names.
98//!
99//! 6. **NFC composition** -- canonical composition to produce the shortest equivalent
100//! form.
101//!
102//! In **case-insensitive** mode, four additional steps are applied after the above:
103//!
104//! 7. **NFD decomposition** (again, on the NFC result). Steps 7, 8, and 10
105//! implement the Unicode canonical caseless matching algorithm (Definition D145):
106//! *"A string X is a canonical caseless match for a string Y if and only if:
107//! NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))"*. Step 9 extends this
108//! with a post-case-fold fixup for Turkish/Azerbaijani and Lithuanian casing.
109//!
110//! 8. **Unicode `toCasefold()`** -- locale-independent full case folding.
111//!
112//! 9. **Post-case-fold fixup** -- maps U+0131 (ı) to ASCII i, and strips
113//! U+0307 COMBINING DOT ABOVE after any `Soft_Dotted`
114//! character (e.g. i, j, Cyrillic і/ј), blocked by intervening starters or
115//! CCC=230 Above combiners (matching the Unicode `After_Soft_Dotted` condition).
116//! This neutralizes two locale-specific casing inconsistencies that
117//! `toCasefold()` alone misses:
118//! - **Turkish/Azerbaijani:** `toCasefold()` treats ı as distinct from i
119//! (ı folds to itself), yet `toUppercase(ı)` = I even without locale
120//! tailoring, and I folds back to i -- creating a collision.
121//! - **Lithuanian:** lowercase adds U+0307 after capital I/J/Į when more
122//! accents are above, and upper/titlecase removes U+0307 after soft-dotted
123//! characters; stripping it ensures stability under Lithuanian casing.
124//!
125//! 10. **NFC composition** (final) -- recompose after case folding to produce the
126//! canonical NFC output.
127//!
128//! # OS compatibility mapping
129//!
130//! Each `PathElementGeneric` also computes an **OS-compatible** form suitable for
131//! use as an actual path element name on the host operating system. It is derived
132//! from the case-sensitive normalized form, by applying the following additional
133//! steps:
134//!
135//! - **Windows**: the characters and patterns listed in the Windows
136//! [naming conventions](https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions)
137//! are handled by mapping them to visually similar fullwidth Unicode equivalents:
138//! forbidden characters (`< > : " \ | ? *`), the final trailing dot, and the first
139//! character of reserved device names (CON, PRN, AUX, NUL, COM0--COM9, LPT0--LPT9,
140//! and their superscript-digit variants).
141//! - **Apple (macOS/iOS)**: converted using [`CFStringGetFileSystemRepresentation`](https://developer.apple.com/documentation/corefoundation/cfstringgetfilesystemrepresentation(_:_:_:))
142//! as recommended by Apple's documentation (produces a representation similar to NFD).
143//! - **Other platforms**: the OS-compatible form is identical to the case-sensitive
144//! normalized form.
145//!
146//! # Types
147//!
148//! The core type is [`PathElementGeneric<'a, S>`], parameterized by a case-sensitivity
149//! marker `S`:
150//!
151//! - [`PathElementCS`] = `PathElementGeneric<'a, CaseSensitive>` -- compile-time
152//! case-sensitive path element.
153//! - [`PathElementCI`] = `PathElementGeneric<'a, CaseInsensitive>` -- compile-time
154//! case-insensitive path element.
155//! - [`PathElement`] = `PathElementGeneric<'a, CaseSensitivity>` -- runtime-selected
156//! case sensitivity via the [`CaseSensitivity`] enum.
157//!
158//! Use the typed aliases ([`PathElementCS`], [`PathElementCI`]) when the case sensitivity
159//! is known at compile time. These implement [`Hash`](core::hash::Hash), which the
160//! runtime-dynamic [`PathElement`] does not (since hashing elements with different
161//! sensitivities into the same map would violate hash/eq consistency).
162//!
163//! The zero-sized marker structs [`CaseSensitive`] and [`CaseInsensitive`] are used as
164//! type parameters, while the [`CaseSensitivity`] enum provides the same choice at runtime.
165//! All three types implement `Into<CaseSensitivity>`.
166//!
167//! # Examples
168//!
169//! ```
170//! # use normalized_path::{PathElementCS, PathElementCI};
171//! // NFD input (e + combining acute) composes to NFC (é), whitespace is trimmed
172//! let pe = PathElementCS::new(" cafe\u{0301}.txt ")?;
173//! assert_eq!(pe.original(), " cafe\u{0301}.txt ");
174//! assert_eq!(pe.normalized(), "caf\u{00E9}.txt");
175//!
176//! // Case-insensitive: German ß case-folds to "ss"
177//! let pe = PathElementCI::new("Stra\u{00DF}e.txt")?;
178//! assert_eq!(pe.original(), "Stra\u{00DF}e.txt");
179//! assert_eq!(pe.normalized(), "strasse.txt");
180//! # Ok::<(), normalized_path::Error>(())
181//! ```
182//!
183//! The OS-compatible form adapts names for the host filesystem. On Windows,
184//! forbidden characters and reserved device names are mapped to safe alternatives;
185//! on Apple, names are converted to a form close to NFD:
186//!
187//! ```
188//! # use normalized_path::PathElementCS;
189//! // A name with a Windows-forbidden character and an accented letter
190//! let pe = PathElementCS::new("caf\u{00E9} 10:30")?;
191//! assert_eq!(pe.normalized(), "caf\u{00E9} 10:30");
192//!
193//! #[cfg(target_os = "windows")]
194//! assert_eq!(pe.os_compatible(), "caf\u{00E9} 10\u{FF1A}30"); // : → fullwidth :
195//!
196//! #[cfg(target_vendor = "apple")]
197//! assert_eq!(pe.os_compatible(), "cafe\u{0301} 10:30"); // NFC → NFD
198//!
199//! #[cfg(not(any(target_os = "windows", target_vendor = "apple")))]
200//! assert_eq!(pe.os_compatible(), pe.normalized()); // unchanged
201//! # Ok::<(), normalized_path::Error>(())
202//! ```
203//!
204//! Equality is based on the normalized form, so different originals can compare equal:
205//!
206//! ```
207//! # use normalized_path::PathElementCS;
208//! // NFD (e + combining acute) and NFC (é) normalize to the same form
209//! let a = PathElementCS::new("cafe\u{0301}.txt")?;
210//! let b = PathElementCS::new("caf\u{00E9}.txt")?;
211//! assert_eq!(a, b);
212//! assert_ne!(a.original(), b.original());
213//! # Ok::<(), normalized_path::Error>(())
214//! ```
215//!
216//! The typed variants implement [`Hash`](core::hash::Hash), so they work in
217//! both hash-based and ordered collections:
218//!
219//! ```
220//! # use std::collections::{BTreeSet, HashSet};
221//! # use normalized_path::PathElementCI;
222//! // Turkish İ, dotless ı, ASCII I, and ASCII i all normalize to the same CI form
223//! let names = ["\u{0130}.txt", "\u{0131}.txt", "I.txt", "i.txt"];
224//! let set: HashSet<_> = names.iter().map(|n| PathElementCI::new(*n).unwrap()).collect();
225//! assert_eq!(set.len(), 1);
226//!
227//! let tree: BTreeSet<_> = names.iter().map(|n| PathElementCI::new(*n).unwrap()).collect();
228//! assert_eq!(tree.len(), 1);
229//! ```
230//!
231//! The runtime-dynamic [`PathElement`] works in ordered collections too, but
232//! comparing or ordering elements with **different** case sensitivities will panic:
233//!
234//! ```
235//! # use std::collections::BTreeSet;
236//! # use normalized_path::{PathElement, CaseSensitive, CaseInsensitive};
237//! // "ss", "SS", "sS", "Ss", sharp s (ß), capital sharp s (ẞ)
238//! let names = ["ss", "SS", "sS", "Ss", "\u{00DF}", "\u{1E9E}"];
239//!
240//! let cs: BTreeSet<_> = names.iter()
241//! .map(|n| PathElement::new(*n, CaseSensitive).unwrap())
242//! .collect();
243//! assert_eq!(cs.len(), 6); // case-sensitive: all distinct
244//!
245//! let ci: BTreeSet<_> = names.iter()
246//! .map(|n| PathElement::new(*n, CaseInsensitive).unwrap())
247//! .collect();
248//! assert_eq!(ci.len(), 1); // case-insensitive: all normalize to "ss"
249//! ```
250//!
251//! # Unicode version
252//!
253//! All Unicode operations (NFC, NFD, case folding, property lookups) use
254//! **Unicode 17.0.0**. The Unicode version is considered part of the crate's
255//! stability contract: it will only be updated in a semver-breaking release to
256//! ensure that normalization results are consistent across all compatible versions.
257//!
258//! # `no_std` support
259//!
260//! This crate supports `no_std` environments. Disable the default `std` feature:
261//!
262//! ```toml
263//! [dependencies]
264//! normalized-path = { version = "...", default-features = false }
265//! ```
266//!
267//! The `std` feature enables `from_os_str` constructors and
268//! `os_str`/`into_os_str` accessors. The `alloc` crate is always required.
269
270#![cfg_attr(not(feature = "std"), no_std)]
271#![cfg_attr(docsrs, feature(doc_cfg))]
272#![warn(clippy::all, clippy::pedantic)]
273
274extern crate alloc;
275
276mod case_sensitivity;
277mod error;
278mod normalize;
279mod os;
280mod path_element;
281mod unicode;
282mod utils;
283
284pub use case_sensitivity::{CaseInsensitive, CaseSensitive, CaseSensitivity};
285pub use error::{Error, ErrorKind, Result};
286pub use path_element::{PathElement, PathElementCI, PathElementCS, PathElementGeneric};
287
288#[cfg(any(feature = "__test", test))]
289pub mod test_helpers {
290 pub use crate::error::ResultKind;
291 pub use crate::normalize::{
292 fixup_case_fold, is_whitespace_like, map_control_chars, map_fullwidth,
293 normalize_ci_from_normalized_cs, normalize_cs, trim_whitespace_like, validate_path_element,
294 };
295 pub use crate::os::{
296 apple_compatible_from_normalized_cs, apple_compatible_from_normalized_cs_fallback,
297 is_reserved_on_windows, windows_compatible_from_normalized_cs,
298 };
299 pub use crate::unicode::{case_fold, is_starter, is_whitespace, nfc, nfd};
300}