Skip to main content

normalized_path/
lib.rs

1//! Opinionated cross-platform, optionally case-insensitive path normalization.
2//!
3//! This crate provides [`PathElementCS`] (case-sensitive), [`PathElementCI`]
4//! (case-insensitive), and [`PathElement`] (runtime-selected) -- types that take a
5//! raw path element name, validate it, normalize it to a canonical form, and compute
6//! an OS-compatible presentation form.
7//!
8//! # Design goals and non-goals
9//!
10//! **Goals:**
11//!
12//! - The normalization procedure is identical on every platform -- the same input
13//!   always produces the same normalized bytes regardless of the host OS.
14//! - If any supported OS considers two names equivalent (e.g. NFC vs NFD on macOS),
15//!   they must normalize to the same value.
16//! - The normalized form is always in NFC (Unicode Normalization Form C), the
17//!   most widely used and compact canonical form.
18//! - Normalization is idempotent: normalizing an already-normalized name always
19//!   produces the same name unchanged.
20//! - The OS-compatible form of a name, when normalized again, produces the same
21//!   normalized value as the original input (round-trip stability).
22//! - Every valid name is representable on every supported OS.  Characters that
23//!   would be rejected or silently altered (Windows forbidden characters, C0 controls)
24//!   are mapped to visually similar safe alternatives.
25//! - If the OS automatically transforms a name (e.g. NFC↔NFD conversion,
26//!   truncation at null bytes), normalizing the transformed name produces the
27//!   same result as normalizing the original.
28//! - In case-insensitive mode, names differing only in case normalize identically,
29//!   with correct handling of edge cases like Turkish dotted/dotless I.
30//!
31//! **Non-goals:**
32//!
33//! - Not every name that a particular OS accepts is considered valid.  Non-UTF-8
34//!   byte sequences, names that normalize to empty (e.g. whitespace-only), and
35//!   names that normalize to `.` or `..` (e.g. `" .. "`) are always rejected.
36//! - A name taken directly from the OS may produce a different OS-compatible form
37//!   after normalization.  For example, a file named `" hello.txt"` (leading space)
38//!   will have the space trimmed, so its OS-compatible form is `"hello.txt"`.
39//! - The OS-compatible form is not guaranteed to be accepted by the OS.  For
40//!   example, it may exceed the OS's path element length limit, or on Apple
41//!   platforms the filesystem may require names in Unicode Stream-Safe Text Format
42//!   which the OS-compatible form does not enforce.
43//! - Windows 8.3 short file names (e.g. `PROGRA~1`) are not handled.
44//! - Visually similar names are not necessarily considered equal.  For example,
45//!   a regular space (U+0020) and a non-breaking space (U+00A0) produce different
46//!   normalized forms despite looking identical.
47//! - Fullwidth and ASCII variants of the same character (e.g. `A` vs `A`) are
48//!   deliberately normalized to the same form.  Users who need to distinguish
49//!   them cannot use this crate.
50//! - Path separators and multi-component paths are not handled.  This crate
51//!   operates on a single path element (one name between separators).  Support
52//!   for full paths may be added in a future version.
53//! - Android versions before 6 (API level 23) are not supported.  Earlier
54//!   versions used Java Modified UTF-8 for filesystem paths, encoding
55//!   supplementary characters as CESU-8 surrogate pairs.
56//!
57//! # Normalization pipeline
58//!
59//! Every path element name goes through the following steps during construction:
60//!
61//! 0. **Byte decoding** (only for `from_bytes`/`from_os_str`) --
62//!    [`String::from_utf8_lossy()`] is applied, replacing invalid byte sequences
63//!    with U+FFFD.  Invalid bytes can be encountered on Unix filesystems, which
64//!    allow arbitrary bytes except `/` and `\0` in names, and on Windows, where
65//!    filenames are WTF-16 and may contain unpaired surrogates.
66//!
67//! 1. **NFD decomposition** -- canonical decomposition to reorder combining marks.
68//!    This is needed because macOS stores filenames in a form close to NFD, so an
69//!    NFD input and an NFC input must produce the same result.  Decomposing first
70//!    ensures combining marks are in canonical order before subsequent steps.
71//!
72//! 2. **Whitespace trimming** -- strips leading and trailing characters with the Unicode
73//!    `White_Space` property, plus the BOM (U+FEFF) and Control Pictures that correspond
74//!    to whitespace control characters (U+2409--U+240D: HT, LF, VT, FF, CR).
75//!    Many applications strip leading/trailing whitespace silently, and macOS
76//!    automatically strips leading BOMs.  Control Pictures are
77//!    included because they are the mapped form of whitespace control characters
78//!    (see step 4), so trimming must be consistent before and after mapping.
79//!
80//! 3. **Fullwidth-to-ASCII mapping** -- maps fullwidth forms (U+FF01--U+FF5E) to their
81//!    ASCII equivalents (U+0021--U+007E).  The Windows OS-compatibility step (see below)
82//!    maps certain ASCII characters to fullwidth to avoid Windows restrictions.  This
83//!    step ensures that the OS-compatible form normalizes back to the same value.
84//!
85//! 4. **Control character mapping** -- maps C0 controls (U+0001--U+001F) and DEL (U+007F)
86//!    to their Unicode Control Picture equivalents (U+2401--U+241F, U+2421).  Control
87//!    characters are invisible, can break terminals and tools, and some OSes reject
88//!    or silently drop them.  Mapping to visible Control Pictures preserves the
89//!    information while making the name safe.  (Null bytes are excluded — see step 5.)
90//!
91//! 5. **Validation** -- rejects empty strings, `.`, `..`, names containing `/`, and
92//!    names containing null bytes (`\0`).  These are universally special on all OSes
93//!    and cannot be used as regular names.
94//!
95//! 6. **NFC composition** -- canonical composition to produce the shortest equivalent
96//!    form.
97//!
98//! In **case-insensitive** mode, three additional steps are applied after the above:
99//!
100//! 7. **NFD decomposition** (again, on the NFC result).  Steps 7--8--10 implement
101//!    the Unicode canonical caseless matching algorithm (Definition D145): *"A string
102//!    X is a canonical caseless match for a string Y if and only if:
103//!    NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))"*, with an additional
104//!    Turkish I/i fixup in step 9.
105//!
106//! 8. **Unicode `toCasefold()`** -- locale-independent case folding.
107//!
108//! 9. **Turkish I/i mapping** -- maps U+0130 (İ) and U+0131 (ı) to ASCII I and i
109//!    respectively, and strips U+0307 COMBINING DOT ABOVE after I/i (with intervening
110//!    non-starter combiners allowed).  Unicode `toCasefold()` is locale-independent
111//!    and treats ı as distinct from i (ı folds to itself), yet `toUppercase(ı)` = I
112//!    even without locale tailoring, and I folds back to i -- creating a collision
113//!    that `toCasefold()` alone misses.
114//!    This post-folding fixup neutralizes those inconsistencies.
115//!
116//! 10. **NFC composition** (final) -- recompose after case folding to produce the
117//!     canonical NFC output.
118//!
119//! # OS compatibility mapping
120//!
121//! Each `PathElementGeneric` also computes an **OS-compatible** form suitable for
122//! use as an actual path element name on the host operating system. It is derived
123//! from the case-sensitive normalized form, by applying the following additional
124//! steps:
125//!
126//! - **Windows**: the characters and patterns listed in the Windows
127//!   [naming conventions](https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions)
128//!   are handled by mapping them to visually similar fullwidth Unicode equivalents:
129//!   forbidden characters (`< > : " \ | ? *`), the final trailing dot, and the first
130//!   character of reserved device names (CON, PRN, AUX, NUL, COM0--COM9, LPT0--LPT9,
131//!   and their superscript-digit variants).
132//! - **Apple (macOS/iOS)**: converted using [`CFStringGetFileSystemRepresentation`](https://developer.apple.com/documentation/corefoundation/cfstringgetfilesystemrepresentation(_:_:_:))
133//!   as recommended by Apple's documentation (produces a representation similar to NFD).
134//! - **Other platforms**: the OS-compatible form is identical to the case-sensitive
135//!   normalized form.
136//!
137//! # Types
138//!
139//! The core type is [`PathElementGeneric<'a, S>`], parameterized by a case-sensitivity
140//! marker `S`:
141//!
142//! - [`PathElementCS`] = `PathElementGeneric<'a, CaseSensitive>` -- compile-time
143//!   case-sensitive path element.
144//! - [`PathElementCI`] = `PathElementGeneric<'a, CaseInsensitive>` -- compile-time
145//!   case-insensitive path element.
146//! - [`PathElement`] = `PathElementGeneric<'a, CaseSensitivity>` -- runtime-selected
147//!   case sensitivity via the [`CaseSensitivity`] enum.
148//!
149//! Use the typed aliases ([`PathElementCS`], [`PathElementCI`]) when the case sensitivity
150//! is known at compile time. These implement [`Hash`](core::hash::Hash), which the
151//! runtime-dynamic [`PathElement`] does not (since hashing elements with different
152//! sensitivities into the same map would violate hash/eq consistency).
153//!
154//! The zero-sized marker structs [`CaseSensitive`] and [`CaseInsensitive`] are used as
155//! type parameters, while the [`CaseSensitivity`] enum provides the same choice at runtime.
156//! All three types implement `Into<CaseSensitivity>`.
157//!
158//! # Examples
159//!
160//! ```
161//! # use normalized_path::{PathElementCS, PathElementCI};
162//! // NFD input (e + combining acute) composes to NFC (é), whitespace is trimmed
163//! let pe = PathElementCS::new("  cafe\u{0301}.txt  ")?;
164//! assert_eq!(pe.original(), "  cafe\u{0301}.txt  ");
165//! assert_eq!(pe.normalized(), "caf\u{00E9}.txt");
166//!
167//! // Case-insensitive: German ß case-folds to "ss"
168//! let pe = PathElementCI::new("Stra\u{00DF}e.txt")?;
169//! assert_eq!(pe.original(), "Stra\u{00DF}e.txt");
170//! assert_eq!(pe.normalized(), "strasse.txt");
171//! # Ok::<(), normalized_path::Error>(())
172//! ```
173//!
174//! The OS-compatible form adapts names for the host filesystem.  On Windows,
175//! forbidden characters and reserved device names are mapped to safe alternatives;
176//! on Apple, names are converted to a form close to NFD:
177//!
178//! ```
179//! # use normalized_path::PathElementCS;
180//! // A name with a Windows-forbidden character and an accented letter
181//! let pe = PathElementCS::new("caf\u{00E9} 10:30")?;
182//! assert_eq!(pe.normalized(), "caf\u{00E9} 10:30");
183//!
184//! #[cfg(target_os = "windows")]
185//! assert_eq!(pe.os_compatible(), "caf\u{00E9} 10\u{FF1A}30"); // : → fullwidth :
186//!
187//! #[cfg(target_vendor = "apple")]
188//! assert_eq!(pe.os_compatible(), "cafe\u{0301} 10:30"); // NFC → NFD
189//!
190//! #[cfg(not(any(target_os = "windows", target_vendor = "apple")))]
191//! assert_eq!(pe.os_compatible(), pe.normalized()); // unchanged
192//! # Ok::<(), normalized_path::Error>(())
193//! ```
194//!
195//! Equality is based on the normalized form, so different originals can compare equal:
196//!
197//! ```
198//! # use normalized_path::PathElementCS;
199//! // NFD (e + combining acute) and NFC (é) normalize to the same form
200//! let a = PathElementCS::new("cafe\u{0301}.txt")?;
201//! let b = PathElementCS::new("caf\u{00E9}.txt")?;
202//! assert_eq!(a, b);
203//! assert_ne!(a.original(), b.original());
204//! # Ok::<(), normalized_path::Error>(())
205//! ```
206//!
207//! The typed variants implement [`Hash`](core::hash::Hash), so they work in
208//! both hash-based and ordered collections:
209//!
210//! ```
211//! # use std::collections::{BTreeSet, HashSet};
212//! # use normalized_path::PathElementCI;
213//! // Turkish İ, dotless ı, ASCII I, and ASCII i all normalize to the same CI form
214//! let names = ["\u{0130}.txt", "\u{0131}.txt", "I.txt", "i.txt"];
215//! let set: HashSet<_> = names.iter().map(|n| PathElementCI::new(*n).unwrap()).collect();
216//! assert_eq!(set.len(), 1);
217//!
218//! let tree: BTreeSet<_> = names.iter().map(|n| PathElementCI::new(*n).unwrap()).collect();
219//! assert_eq!(tree.len(), 1);
220//! ```
221//!
222//! The runtime-dynamic [`PathElement`] works in ordered collections too, but
223//! comparing or ordering elements with **different** case sensitivities will panic:
224//!
225//! ```
226//! # use std::collections::BTreeSet;
227//! # use normalized_path::{PathElement, CaseSensitive};
228//! let names = ["README.md", "readme.md", "Readme.MD"];
229//! let tree: BTreeSet<_> = names.iter()
230//!     .map(|n| PathElement::new(*n, CaseSensitive).unwrap())
231//!     .collect();
232//! assert_eq!(tree.len(), 3); // case-sensitive: all distinct
233//! ```
234//!
235//! # Unicode version
236//!
237//! All Unicode operations (NFC, NFD, case folding, property lookups) use
238//! **Unicode 17.0.0**. The Unicode version is considered part of the crate's
239//! stability contract: it will only be updated in a semver-breaking release to
240//! ensure that normalization results are consistent across all compatible versions.
241//!
242//! # `no_std` support
243//!
244//! This crate supports `no_std` environments. Disable the default `std` feature:
245//!
246//! ```toml
247//! [dependencies]
248//! normalized-path = { version = "...", default-features = false }
249//! ```
250//!
251//! The `std` feature enables `from_os_str` constructors and
252//! `os_str`/`into_os_str` accessors. The `alloc` crate is always required.
253
254#![cfg_attr(not(feature = "std"), no_std)]
255#![cfg_attr(docsrs, feature(doc_cfg))]
256#![warn(clippy::all, clippy::pedantic)]
257
258extern crate alloc;
259
260mod case_sensitivity;
261mod error;
262mod normalize;
263mod os;
264mod path_element;
265mod unicode;
266mod utils;
267
268pub use case_sensitivity::{CaseInsensitive, CaseSensitive, CaseSensitivity};
269pub use error::{Error, ErrorKind, Result};
270pub use path_element::{PathElement, PathElementCI, PathElementCS, PathElementGeneric};
271
272#[cfg(any(feature = "__test", test))]
273pub mod test_helpers {
274    pub use crate::error::ResultKind;
275    pub use crate::normalize::{
276        is_whitespace_like, map_control_chars, map_fullwidth, map_turkish_i,
277        normalize_ci_from_normalized_cs, normalize_cs, trim_whitespace_like, validate_path_element,
278    };
279    pub use crate::os::{
280        apple_compatible_from_normalized_cs, apple_compatible_from_normalized_cs_fallback,
281        is_reserved_on_windows, windows_compatible_from_normalized_cs,
282    };
283    pub use crate::unicode::{case_fold, is_starter, is_whitespace, nfc, nfd};
284}