Skip to main content

normalized_path/
lib.rs

1//! Opinionated cross-platform, optionally case-insensitive path normalization.
2//!
3//! This crate provides [`PathElementCS`] (case-sensitive), [`PathElementCI`]
4//! (case-insensitive), and [`PathElement`] (runtime-selected) -- types that take a
5//! raw path element name, validate it, normalize it to a canonical form, and compute
6//! an OS-compatible presentation form.
7//!
8//! # Design goals and non-goals
9//!
10//! **Goals:**
11//!
12//! - The normalization procedure is identical on every platform -- the same input
13//!   always produces the same normalized bytes regardless of the host OS.
14//! - If any supported OS considers two names equivalent (e.g. NFC vs NFD on macOS),
15//!   they must normalize to the same value.
16//! - The normalized form is always in NFC (Unicode Normalization Form C), the
17//!   most widely used and compact canonical form.
18//! - Normalization is idempotent: normalizing an already-normalized name always
19//!   produces the same name unchanged.
20//! - The OS-compatible form of a name, when normalized again, produces the same
21//!   normalized value as the original input (round-trip stability).
22//! - Every valid name is representable on every supported OS.  Characters that
23//!   would be rejected or silently altered (Windows forbidden characters) are
24//!   mapped to visually similar safe alternatives.
25//! - If the OS automatically transforms a name (e.g. NFC↔NFD conversion,
26//!   truncation at null bytes), normalizing the transformed name produces the
27//!   same result as normalizing the original.
28//! - In case-insensitive mode, names differing only in case normalize identically,
29//!   including edge cases from Turkish/Azerbaijani and Lithuanian casing rules
30//!   (see step 8 below).
31//!
32//! **Non-goals:**
33//!
34//! - Not every name that a particular OS accepts is considered valid.  Non-UTF-8
35//!   byte sequences, names that normalize to empty (e.g. whitespace-only), and
36//!   names that normalize to `.` or `..` (e.g. `" .. "`) are always rejected.
37//! - A name taken directly from the OS may produce a different OS-compatible form
38//!   after normalization.  For example, a file named `" hello.txt"` (leading space)
39//!   will have the space trimmed, so its OS-compatible form is `"hello.txt"`.
40//! - The OS-compatible form is not guaranteed to be accepted by the OS.  For
41//!   example, it may exceed the OS's path element length limit, or on Apple
42//!   platforms the filesystem may require names in Unicode Stream-Safe Text Format
43//!   which the OS-compatible form does not enforce.
44//! - Windows 8.3 short file names (e.g. `PROGRA~1`) are not handled.
45//! - Visually similar names are not necessarily considered equal.  For example,
46//!   a regular space (U+0020) and a non-breaking space (U+00A0) produce different
47//!   normalized forms despite looking identical, and the ligature `fi` (U+FB01) is
48//!   distinct from the two-character sequence `fi`.
49//! - Fullwidth and ASCII variants of the same character (e.g. `A` vs `A`) are
50//!   deliberately normalized to the same form.  Users who need to distinguish
51//!   them cannot use this crate.
52//! - In case-insensitive mode, Turkish İ (U+0130), dotless ı (U+0131), and
53//!   ASCII I/i are all deliberately normalized to the same form.  Users who
54//!   need to distinguish them cannot use case-insensitive mode.
55//! - Invalid UTF-8 byte sequences in `from_bytes`/`from_os_str` are rejected.
56//! - Names containing unassigned Unicode code points are rejected.  This makes it
57//!   much more likely that normalization results for accepted names remain stable
58//!   when upgrading to a future Unicode version (see [Unicode version](#unicode-version)).
59//! - Path separators and multi-component paths are not handled.  This crate
60//!   operates on a single path element (one name between separators).  Support
61//!   for full paths may be added in a future version.
62//! - Android versions before 6 (API level 23) are not supported.  Earlier
63//!   versions used Java Modified UTF-8 for filesystem paths, encoding
64//!   supplementary characters as CESU-8 surrogate pairs.
65//!
66//! # Normalization pipeline
67//!
68//! Every path element name goes through the following steps during construction:
69//!
70//! 0. **UTF-8 validation** (only for `from_bytes`/`from_os_str`) --
71//!    the input must be valid UTF-8; invalid byte sequences are rejected with
72//!    [`ErrorKind::InvalidUtf8`].
73//!
74//! 1. **NFD decomposition** -- canonical decomposition to reorder combining marks.
75//!    This is needed because macOS stores filenames in a form close to NFD, so an
76//!    NFD input and an NFC input must produce the same result.  Decomposing first
77//!    ensures combining marks are in canonical order before subsequent steps.
78//!
79//! 2. **Whitespace trimming** -- strips leading and trailing characters with the Unicode
80//!    `White_Space` property (excluding control characters, which are rejected in step 4).
81//!    Many applications strip leading/trailing whitespace silently.
82//!
83//! 3. **Fullwidth-to-ASCII mapping** -- maps fullwidth forms (U+FF01--U+FF5E) to their
84//!    ASCII equivalents (U+0021--U+007E).  The Windows OS-compatibility step (see below)
85//!    maps certain ASCII characters to fullwidth to avoid Windows restrictions.  This
86//!    step ensures that the OS-compatible form normalizes back to the same value.
87//!
88//! 4. **Validation** -- rejects empty strings, `.`, `..`, names containing `/`,
89//!    null bytes (`\0`), characters with the Unicode `Control` general category, BOM (U+FEFF), and unassigned
90//!    Unicode characters.  The first group is universally special on all OSes and
91//!    cannot be used as regular names.  Control characters are invisible, can break
92//!    terminals and tools, and some OSes reject or silently drop them.  Unassigned
93//!    characters are rejected to ensure normalization stability across Unicode
94//!    versions (see [Unicode stability policies](#unicode-stability-policies)).
95//!
96//! 5. **NFC composition** -- canonical composition to produce the shortest equivalent
97//!    form.
98//!
99//! In **case-insensitive** mode, four additional steps are applied after the above:
100//!
101//! 6. **NFD decomposition** (again, on the NFC result).  Steps 6, 7, and 9
102//!    implement the Unicode canonical caseless matching algorithm (Definition D145):
103//!    *"A string X is a canonical caseless match for a string Y if and only if:
104//!    NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))"*.  Step 8 extends this
105//!    with a post-case-fold fixup for Turkish/Azerbaijani and Lithuanian casing.
106//!
107//! 7. **Unicode `toCasefold()`** -- locale-independent full case folding.
108//!
109//! 8. **Post-case-fold fixup** -- maps U+0131 (ı) to ASCII i, and strips
110//!    U+0307 COMBINING DOT ABOVE after any `Soft_Dotted`
111//!    character (e.g. i, j, Cyrillic і/ј), blocked by intervening starters or
112//!    CCC=230 Above combiners (matching the Unicode `After_Soft_Dotted` condition).
113//!    This neutralizes casing inconsistencies that `toCasefold()` alone misses:
114//!    - **Dotless ı (U+0131):** `toCasefold()` treats ı as distinct from i
115//!      (ı folds to itself), yet `toUppercase(ı)` = I even without locale
116//!      tailoring, and I folds back to i -- creating a collision.
117//!    - **Lithuanian casing rules:** when lowercasing I/J/Į with additional accents above,
118//!      Lithuanian rules insert U+0307 to retain the visual dot (e.g.
119//!      `lt_lowercase("J\u{0301}")` = `j\u{0307}\u{0301}`).  Conversely,
120//!      Lithuanian upper/titlecase removes U+0307 after soft-dotted characters
121//!      (e.g. `lt_uppercase("j\u{0307}")` = `J`).  Stripping U+0307 after
122//!      soft-dotted characters ensures stability under both directions.
123//!
124//! 9. **NFC composition** (final) -- recompose after case folding to produce the
125//!    canonical NFC output.
126//!
127//! # OS compatibility mapping
128//!
129//! Each `PathElementGeneric` also computes an **OS-compatible** form suitable for
130//! use as an actual path element name on the host operating system. It is derived
131//! from the case-sensitive normalized form, by applying the following additional
132//! steps:
133//!
134//! - **Windows**: the characters and patterns listed in the Windows
135//!   [naming conventions](https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions)
136//!   are handled by mapping them to visually similar fullwidth Unicode equivalents:
137//!   forbidden characters (`< > : " \ | ? *`), the final trailing dot, and the first
138//!   character of reserved device names (CON, PRN, AUX, NUL, COM0--COM9, LPT0--LPT9,
139//!   and their superscript-digit variants).
140//! - **Apple (macOS/iOS)**: converted using [`CFStringGetFileSystemRepresentation`](https://developer.apple.com/documentation/corefoundation/cfstringgetfilesystemrepresentation(_:_:_:))
141//!   as recommended by Apple's documentation (produces a representation similar to NFD).
142//! - **Other platforms**: the OS-compatible form is identical to the case-sensitive
143//!   normalized form.
144//!
145//! # Types
146//!
147//! The core type is [`PathElementGeneric<'a, S>`], parameterized by a case-sensitivity
148//! marker `S`:
149//!
150//! - [`PathElementCS`] = `PathElementGeneric<'a, CaseSensitive>` -- compile-time
151//!   case-sensitive path element.
152//! - [`PathElementCI`] = `PathElementGeneric<'a, CaseInsensitive>` -- compile-time
153//!   case-insensitive path element.
154//! - [`PathElement`] = `PathElementGeneric<'a, CaseSensitivity>` -- runtime-selected
155//!   case sensitivity via the [`CaseSensitivity`] enum.
156//!
157//! Use the typed aliases ([`PathElementCS`], [`PathElementCI`]) when the case sensitivity
158//! is known at compile time. These implement [`Hash`](core::hash::Hash), which the
159//! runtime-dynamic [`PathElement`] does not (since hashing elements with different
160//! sensitivities into the same map would violate hash/eq consistency).
161//!
162//! The zero-sized marker structs [`CaseSensitive`] and [`CaseInsensitive`] are used as
163//! type parameters, while the [`CaseSensitivity`] enum provides the same choice at runtime.
164//! All three types implement `Into<CaseSensitivity>`.
165//!
166//! # Unicode version
167//!
168//! All Unicode operations (NFC, NFD, case folding, property lookups) use
169//! **Unicode 17.0.0**. Updating to a newer Unicode version is not considered a
170//! semver-breaking change as long as the normalization pipeline produces identical
171//! results for all strings consisting of characters assigned in the previous version.
172//! If a new Unicode version were to change normalization results for previously
173//! assigned characters, that update would be a semver-breaking change.
174//!
175//! This is unlikely — though not formally guaranteed — due to the following
176//! [Character Encoding Stability Policies](https://www.unicode.org/policies/stability_policy.html):
177//! - If a string contains only characters from a given version of Unicode, and it
178//!   is put into a normalized form in accordance with that version of Unicode, then
179//!   the results will be identical to the results of putting that string into a
180//!   normalized form in accordance with any subsequent version of Unicode.
181//! - Once a character is assigned, its canonical combining class will not change.
182//! - Once a character is encoded, its properties may still be changed, but not in
183//!   such a way as to change the fundamental identity of the character.
184//! - For each string S containing only assigned characters in a given Unicode
185//!   version, `toCasefold(toNFKC(S))` under that version is identical to
186//!   `toCasefold(toNFKC(S))` under any later version of Unicode.
187//!
188//! # `no_std` support
189//!
190//! This crate supports `no_std` environments. Disable the default `std` feature:
191//!
192//! ```toml
193//! [dependencies]
194//! normalized-path = { version = "...", default-features = false }
195//! ```
196//!
197//! The `std` feature enables `from_os_str` constructors and
198//! `os_str`/`into_os_str` accessors. The `alloc` crate is always required.
199//!
200//! # Examples
201//!
202//! ```
203//! # use normalized_path::{PathElementCS, PathElementCI};
204//! // NFD input (e + combining acute) composes to NFC (é), whitespace is trimmed
205//! let pe = PathElementCS::new("  cafe\u{0301}.txt  ")?;
206//! assert_eq!(pe.original(), "  cafe\u{0301}.txt  ");
207//! assert_eq!(pe.normalized(), "caf\u{00E9}.txt");
208//!
209//! // Case-insensitive: German ß case-folds to "ss"
210//! let pe = PathElementCI::new("Stra\u{00DF}e.txt")?;
211//! assert_eq!(pe.original(), "Stra\u{00DF}e.txt");
212//! assert_eq!(pe.normalized(), "strasse.txt");
213//! # Ok::<(), normalized_path::Error>(())
214//! ```
215//!
216//! The OS-compatible form adapts names for the host filesystem.  On Windows,
217//! forbidden characters and reserved device names are mapped to safe alternatives;
218//! on Apple, names are converted to a form close to NFD:
219//!
220//! ```
221//! # use normalized_path::PathElementCS;
222//! // A name with a Windows-forbidden character and an accented letter
223//! let pe = PathElementCS::new("caf\u{00E9} 10:30")?;
224//! assert_eq!(pe.normalized(), "caf\u{00E9} 10:30");
225//!
226//! #[cfg(target_os = "windows")]
227//! assert_eq!(pe.os_compatible(), "caf\u{00E9} 10\u{FF1A}30"); // : → fullwidth :
228//!
229//! #[cfg(target_vendor = "apple")]
230//! assert_eq!(pe.os_compatible(), "cafe\u{0301} 10:30"); // NFC → NFD
231//!
232//! #[cfg(not(any(target_os = "windows", target_vendor = "apple")))]
233//! assert_eq!(pe.os_compatible(), pe.normalized()); // unchanged
234//! # Ok::<(), normalized_path::Error>(())
235//! ```
236//!
237//! Equality is based on the normalized form, so different originals can compare equal:
238//!
239//! ```
240//! # use normalized_path::PathElementCS;
241//! // NFD (e + combining acute) and NFC (é) normalize to the same form
242//! let a = PathElementCS::new("cafe\u{0301}.txt")?;
243//! let b = PathElementCS::new("caf\u{00E9}.txt")?;
244//! assert_eq!(a, b);
245//! assert_ne!(a.original(), b.original());
246//! # Ok::<(), normalized_path::Error>(())
247//! ```
248//!
249//! The typed variants implement [`Hash`](core::hash::Hash), so they work in
250//! both hash-based and ordered collections:
251//!
252//! ```
253//! # use std::collections::{BTreeSet, HashSet};
254//! # use normalized_path::PathElementCI;
255//! // Turkish İ, dotless ı, ASCII I, and ASCII i all normalize to the same CI form
256//! let names = ["\u{0130}.txt", "\u{0131}.txt", "I.txt", "i.txt"];
257//! let set: HashSet<_> = names.iter().map(|n| PathElementCI::new(*n).unwrap()).collect();
258//! assert_eq!(set.len(), 1);
259//!
260//! let tree: BTreeSet<_> = names.iter().map(|n| PathElementCI::new(*n).unwrap()).collect();
261//! assert_eq!(tree.len(), 1);
262//! ```
263//!
264//! The runtime-dynamic [`PathElement`] works in ordered collections too, but
265//! comparing or ordering elements with **different** case sensitivities will panic:
266//!
267//! ```
268//! # use std::collections::BTreeSet;
269//! # use normalized_path::{PathElement, CaseSensitive, CaseInsensitive};
270//! // "ss", "SS", "sS", "Ss", sharp s (ß), capital sharp s (ẞ)
271//! let names = ["ss", "SS", "sS", "Ss", "\u{00DF}", "\u{1E9E}"];
272//!
273//! let cs: BTreeSet<_> = names.iter()
274//!     .map(|n| PathElement::new(*n, CaseSensitive).unwrap())
275//!     .collect();
276//! assert_eq!(cs.len(), 6); // case-sensitive: all distinct
277//!
278//! let ci: BTreeSet<_> = names.iter()
279//!     .map(|n| PathElement::new(*n, CaseInsensitive).unwrap())
280//!     .collect();
281//! assert_eq!(ci.len(), 1); // case-insensitive: all normalize to "ss"
282//! ```
283
284#![cfg_attr(not(feature = "std"), no_std)]
285#![cfg_attr(docsrs, feature(doc_cfg))]
286#![warn(clippy::all, clippy::pedantic)]
287
288extern crate alloc;
289
290mod case_sensitivity;
291mod error;
292mod normalize;
293mod os;
294mod path_element;
295mod unicode;
296mod utils;
297
298pub use case_sensitivity::{CaseInsensitive, CaseSensitive, CaseSensitivity};
299pub use error::{Error, ErrorKind, Result};
300pub use path_element::{PathElement, PathElementCI, PathElementCS, PathElementGeneric};
301
302#[cfg(any(feature = "__test", test))]
303pub mod test_helpers {
304    pub use crate::error::ResultKind;
305    pub use crate::normalize::{
306        fixup_case_fold, map_fullwidth, normalize_ci_from_normalized_cs, normalize_cs,
307        validate_path_element,
308    };
309    pub use crate::os::{
310        apple_compatible_from_normalized_cs, apple_compatible_from_normalized_cs_fallback,
311        is_reserved_on_windows, windows_compatible_from_normalized_cs,
312    };
313    pub use crate::unicode::{case_fold, is_starter, is_whitespace, nfc, nfd};
314}