Skip to main content

normalized_path/
lib.rs

1//! Opinionated cross-platform, optionally case-insensitive path normalization.
2//!
3//! This crate provides [`PathElementGeneric`], a type that takes a raw path element name,
4//! validates it, normalizes it to a canonical form, and computes an OS-compatible
5//! presentation form.
6//!
7//! # Design goals and non-goals
8//!
9//! **Goals:**
10//!
11//! - The normalization procedure is identical on every platform -- the same input
12//!   always produces the same normalized bytes regardless of the host OS.
13//! - If any supported OS considers two names equivalent (e.g. NFC vs NFD on macOS),
14//!   they must normalize to the same value.
15//! - The normalized form is always in NFC (Unicode Normalization Form C), the
16//!   most widely used and compact canonical form.
17//! - Normalization is idempotent: normalizing an already-normalized name always
18//!   produces the same name unchanged.
19//! - The OS-compatible form of a name, when normalized again, produces the same
20//!   normalized value as the original input (round-trip stability).
21//! - Every valid name is representable on every supported OS.  Characters that
22//!   would be rejected or silently altered (Windows forbidden characters, C0 controls)
23//!   are mapped to visually similar safe alternatives.
24//! - If the OS automatically transforms a name (e.g. NFC↔NFD conversion,
25//!   truncation at null bytes), normalizing the transformed name produces the
26//!   same result as normalizing the original.
27//! - In case-insensitive mode, names differing only in case normalize identically,
28//!   with correct handling of edge cases like Turkish dotted/dotless I.
29//!
30//! **Non-goals:**
31//!
32//! - Not every name that a particular OS accepts is considered valid.  Non-UTF-8
33//!   byte sequences, names that normalize to empty (e.g. whitespace-only), and
34//!   names that normalize to `.` or `..` (e.g. `" .. "`) are always rejected.
35//! - A name taken directly from the OS may produce a different OS-compatible form
36//!   after normalization.  For example, a file named `" hello.txt"` (leading space)
37//!   will have the space trimmed, so its OS-compatible form is `"hello.txt"`.
38//! - The OS-compatible form is not guaranteed to be accepted by the OS.  For
39//!   example, it may exceed the OS's path element length limit, or on Apple
40//!   platforms the filesystem may require names in Unicode Stream-Safe Text Format
41//!   which the OS-compatible form does not enforce.
42//! - Windows 8.3 short file names (e.g. `PROGRA~1`) are not handled.
43//! - Visually similar names are not necessarily considered equal.  For example,
44//!   a regular space (U+0020) and a non-breaking space (U+00A0) produce different
45//!   normalized forms despite looking identical.
46//! - Fullwidth and ASCII variants of the same character (e.g. `A` vs `A`) are
47//!   deliberately normalized to the same form.  Users who need to distinguish
48//!   them cannot use this crate.
49//! - Path separators and multi-component paths are not handled.  This crate
50//!   operates on a single path element (one name between separators).  Support
51//!   for full paths may be added in a future version.
52//! - Android versions before 6 (API level 23) are not supported.  Earlier
53//!   versions used Java Modified UTF-8 for filesystem paths, encoding
54//!   supplementary characters as CESU-8 surrogate pairs.
55//!
56//! # Normalization pipeline
57//!
58//! Every path element name goes through the following steps during construction:
59//!
60//! 0. **Byte decoding** (only for `from_bytes`/`from_os_str`) --
61//!    [`String::from_utf8_lossy()`] is applied, replacing invalid byte sequences
62//!    with U+FFFD.  Invalid bytes can be encountered on Unix filesystems, which
63//!    allow arbitrary bytes except `/` and `\0` in names, and on Windows, where
64//!    filenames are WTF-16 and may contain unpaired surrogates.
65//!
66//! 1. **NFD decomposition** -- canonical decomposition to reorder combining marks.
67//!    This is needed because macOS stores filenames in a form close to NFD, so an
68//!    NFD input and an NFC input must produce the same result.  Decomposing first
69//!    ensures combining marks are in canonical order before subsequent steps.
70//!
71//! 2. **Whitespace trimming** -- strips leading and trailing characters with the Unicode
72//!    `White_Space` property, plus the BOM (U+FEFF) and Control Pictures that correspond
73//!    to whitespace control characters (U+2409--U+240D: HT, LF, VT, FF, CR).
74//!    Many applications strip leading/trailing whitespace silently, and macOS
75//!    automatically strips leading BOMs.  Control Pictures are
76//!    included because they are the mapped form of whitespace control characters
77//!    (see step 4), so trimming must be consistent before and after mapping.
78//!
79//! 3. **Fullwidth-to-ASCII mapping** -- maps fullwidth forms (U+FF01--U+FF5E) to their
80//!    ASCII equivalents (U+0021--U+007E).  The Windows OS-compatibility step (see below)
81//!    maps certain ASCII characters to fullwidth to avoid Windows restrictions.  This
82//!    step ensures that the OS-compatible form normalizes back to the same value.
83//!
84//! 4. **Control character mapping** -- maps C0 controls (U+0001--U+001F) and DEL (U+007F)
85//!    to their Unicode Control Picture equivalents (U+2401--U+241F, U+2421).  Control
86//!    characters are invisible, can break terminals and tools, and some OSes reject
87//!    or silently drop them.  Mapping to visible Control Pictures preserves the
88//!    information while making the name safe.  (Null bytes are excluded — see step 5.)
89//!
90//! 5. **Validation** -- rejects empty strings, `.`, `..`, names containing `/`, and
91//!    names containing null bytes (`\0`).  These are universally special on all OSes
92//!    and cannot be used as regular names.
93//!
94//! 6. **NFC composition** -- canonical composition to produce the shortest equivalent
95//!    form.
96//!
97//! In **case-insensitive** mode, three additional steps are applied after the above:
98//!
99//! 7. **NFD decomposition** (again, on the NFC result).  Steps 7--8--10 implement
100//!    the Unicode canonical caseless matching algorithm (Definition D145): *"A string
101//!    X is a canonical caseless match for a string Y if and only if:
102//!    NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))"*, with an additional
103//!    Turkish I/i fixup in step 9.
104//!
105//! 8. **Unicode `toCasefold()`** -- locale-independent case folding.
106//!
107//! 9. **Turkish I/i mapping** -- maps U+0130 (İ) and U+0131 (ı) to ASCII I and i
108//!    respectively, and strips U+0307 COMBINING DOT ABOVE after I/i (with intervening
109//!    non-starter combiners allowed).  Unicode `toCasefold()` is locale-independent
110//!    and treats ı as distinct from i, yet locale-aware uppercasing/lowercasing can
111//!    map them to ASCII I/i, creating collisions that `toCasefold()` alone misses.
112//!    This post-folding fixup neutralizes those inconsistencies.
113//!
114//! 10. **NFC composition** (final) -- recompose after case folding to produce the
115//!     canonical NFC output.
116//!
117//! # OS compatibility mapping
118//!
119//! Each `PathElementGeneric` also computes an **OS-compatible** form suitable for
120//! use as an actual path element name on the host operating system. It is derived
121//! from the case-sensitive normalized form, by applying the following additional
122//! steps:
123//!
124//! - **Windows**: the characters and patterns listed in the Windows
125//!   [naming conventions](https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions)
126//!   are handled by mapping them to visually similar fullwidth Unicode equivalents:
127//!   forbidden characters (`< > : " \ | ? *`), the final trailing dot, and the first
128//!   character of reserved device names (CON, PRN, AUX, NUL, COM0--COM9, LPT0--LPT9,
129//!   and their superscript-digit variants).
130//! - **Apple (macOS/iOS)**: converted using [`CFStringGetFileSystemRepresentation`](https://developer.apple.com/documentation/corefoundation/cfstringgetfilesystemrepresentation(_:_:_:))
131//!   as recommended by Apple's documentation (produces a representation similar to NFD).
132//! - **Other platforms**: the OS-compatible form is identical to the case-sensitive
133//!   normalized form.
134//!
135//! # Types
136//!
137//! The core type is [`PathElementGeneric<'a, S>`], parameterized by a case-sensitivity
138//! marker `S`:
139//!
140//! - [`PathElementCS`] = `PathElementGeneric<'a, CaseSensitive>` -- compile-time
141//!   case-sensitive path element.
142//! - [`PathElementCI`] = `PathElementGeneric<'a, CaseInsensitive>` -- compile-time
143//!   case-insensitive path element.
144//! - [`PathElement`] = `PathElementGeneric<'a, CaseSensitivity>` -- runtime-selected
145//!   case sensitivity via the [`CaseSensitivity`] enum.
146//!
147//! Use the typed aliases ([`PathElementCS`], [`PathElementCI`]) when the case sensitivity
148//! is known at compile time. These implement [`Hash`](core::hash::Hash), which the
149//! runtime-dynamic [`PathElement`] does not (since hashing elements with different
150//! sensitivities into the same map would violate hash/eq consistency).
151//!
152//! The zero-sized marker structs [`CaseSensitive`] and [`CaseInsensitive`] are used as
153//! type parameters, while the [`CaseSensitivity`] enum provides the same choice at runtime.
154//! All three types implement `Into<CaseSensitivity>`.
155//!
156//! # Examples
157//!
158//! ```
159//! # use normalized_path::{PathElementCS, PathElementCI};
160//! // Case-sensitive: whitespace trimmed, fullwidth mapped to ASCII, NFC composed
161//! let pe = PathElementCS::new("  \u{FF21}bc.txt  ")?;
162//! assert_eq!(pe.normalized(), "Abc.txt");
163//!
164//! // Case-insensitive: additionally case-folded
165//! let pe = PathElementCI::new("Hello.TXT")?;
166//! assert_eq!(pe.normalized(), "hello.txt");
167//! # Ok::<(), normalized_path::Error>(())
168//! ```
169//!
170//! Equality is based on the normalized form, so different originals can compare equal:
171//!
172//! ```
173//! # use normalized_path::PathElementCS;
174//! let a = PathElementCS::new("  hello.txt  ")?;
175//! let b = PathElementCS::new("hello.txt")?;
176//! assert_eq!(a, b);
177//! assert_ne!(a.original(), b.original());
178//! # Ok::<(), normalized_path::Error>(())
179//! ```
180//!
181//! The typed variants implement [`Hash`](core::hash::Hash), so they work in
182//! both hash-based and ordered collections:
183//!
184//! ```
185//! # use std::collections::{BTreeSet, HashSet};
186//! # use normalized_path::PathElementCI;
187//! let names = ["README.md", "readme.md", "Readme.MD"];
188//! let set: HashSet<_> = names.iter().map(|n| PathElementCI::new(*n).unwrap()).collect();
189//! assert_eq!(set.len(), 1);
190//!
191//! let tree: BTreeSet<_> = names.iter().map(|n| PathElementCI::new(*n).unwrap()).collect();
192//! assert_eq!(tree.len(), 1);
193//! ```
194//!
195//! The runtime-dynamic [`PathElement`] works in ordered collections too, but
196//! comparing or ordering elements with **different** case sensitivities will panic:
197//!
198//! ```
199//! # use std::collections::BTreeSet;
200//! # use normalized_path::{PathElement, CaseSensitive};
201//! let names = ["README.md", "readme.md", "Readme.MD"];
202//! let tree: BTreeSet<_> = names.iter()
203//!     .map(|n| PathElement::new(*n, CaseSensitive).unwrap())
204//!     .collect();
205//! assert_eq!(tree.len(), 3); // case-sensitive: all distinct
206//! ```
207//!
208//! # Unicode version
209//!
210//! All Unicode operations (NFC, NFD, case folding, property lookups) use
211//! **Unicode 17.0.0**. The Unicode version is considered part of the crate's
212//! stability contract: it will only be updated in a semver-breaking release to
213//! ensure that normalization results are consistent across all compatible versions.
214//!
215//! # `no_std` support
216//!
217//! This crate supports `no_std` environments. Disable the default `std` feature:
218//!
219//! ```toml
220//! [dependencies]
221//! normalized-path = { version = "...", default-features = false }
222//! ```
223//!
224//! The `std` feature enables `from_os_str` constructors and
225//! `os_str`/`into_os_str` accessors. The `alloc` crate is always required.
226
227#![cfg_attr(not(feature = "std"), no_std)]
228#![cfg_attr(docsrs, feature(doc_cfg))]
229#![warn(clippy::all, clippy::pedantic)]
230
231extern crate alloc;
232
233mod case_sensitivity;
234mod error;
235mod normalize;
236mod os;
237mod path_element;
238mod unicode;
239mod utils;
240
241pub use case_sensitivity::{CaseInsensitive, CaseSensitive, CaseSensitivity};
242pub use error::{Error, ErrorKind, Result};
243pub use path_element::{PathElement, PathElementCI, PathElementCS, PathElementGeneric};
244
245#[cfg(any(feature = "__test", test))]
246pub mod test_helpers {
247    pub use crate::error::ResultKind;
248    pub use crate::normalize::{
249        is_whitespace_like, map_control_chars, map_fullwidth, map_turkish_i,
250        normalize_ci_from_normalized_cs, normalize_cs, trim_whitespace_like, validate_path_element,
251    };
252    pub use crate::os::{
253        apple_compatible_from_normalized_cs, apple_compatible_from_normalized_cs_fallback,
254        is_reserved_on_windows, windows_compatible_from_normalized_cs,
255    };
256    pub use crate::unicode::{case_fold, is_starter, is_whitespace, nfc, nfd};
257}