Crate normalized_path

Expand description

Opinionated cross-platform, optionally case-insensitive path normalization.

This crate provides PathElementCS (case-sensitive), PathElementCI (case-insensitive), and PathElement (runtime-selected) – types that take a raw path element name, validate it, normalize it to a canonical form, and compute an OS-compatible presentation form.

§Design goals and non-goals

Goals:

The normalization procedure is identical on every platform – the same input always produces the same normalized bytes regardless of the host OS.
If any supported OS considers two names equivalent (e.g. NFC vs NFD on macOS), they must normalize to the same value.
The normalized form is always in NFC (Unicode Normalization Form C), the most widely used and compact canonical form.
Normalization is idempotent: normalizing an already-normalized name always produces the same name unchanged.
The OS-compatible form of a name, when normalized again, produces the same normalized value as the original input (round-trip stability).
Every valid name is representable on every supported OS. Characters that would be rejected or silently altered (Windows forbidden characters, C0 controls) are mapped to visually similar safe alternatives.
If the OS automatically transforms a name (e.g. NFC↔NFD conversion, truncation at null bytes), normalizing the transformed name produces the same result as normalizing the original.
In case-insensitive mode, names differing only in case normalize identically, with correct handling of edge cases like Turkish dotted/dotless I.

Non-goals:

Not every name that a particular OS accepts is considered valid. Non-UTF-8 byte sequences, names that normalize to empty (e.g. whitespace-only), and names that normalize to . or .. (e.g. " .. ") are always rejected.
A name taken directly from the OS may produce a different OS-compatible form after normalization. For example, a file named " hello.txt" (leading space) will have the space trimmed, so its OS-compatible form is "hello.txt".
The OS-compatible form is not guaranteed to be accepted by the OS. For example, it may exceed the OS’s path element length limit, or on Apple platforms the filesystem may require names in Unicode Stream-Safe Text Format which the OS-compatible form does not enforce.
Windows 8.3 short file names (e.g. PROGRA~1) are not handled.
Visually similar names are not necessarily considered equal. For example, a regular space (U+0020) and a non-breaking space (U+00A0) produce different normalized forms despite looking identical, and the ligature ﬁ (U+FB01) is distinct from the two-character sequence fi.
Fullwidth and ASCII variants of the same character (e.g. Ａ vs A) are deliberately normalized to the same form. Users who need to distinguish them cannot use this crate.
In case-insensitive mode, Turkish İ (U+0130), dotless ı (U+0131), and ASCII I/i are all deliberately normalized to the same form. Users who need to distinguish them cannot use case-insensitive mode.
Path separators and multi-component paths are not handled. This crate operates on a single path element (one name between separators). Support for full paths may be added in a future version.
Android versions before 6 (API level 23) are not supported. Earlier versions used Java Modified UTF-8 for filesystem paths, encoding supplementary characters as CESU-8 surrogate pairs.

§Normalization pipeline

Every path element name goes through the following steps during construction:

Byte decoding (only for from_bytes/from_os_str) – String::from_utf8_lossy() is applied, replacing invalid byte sequences with U+FFFD. Invalid bytes can be encountered on Unix filesystems, which allow arbitrary bytes except / and \0 in names, and on Windows, where filenames are WTF-16 and may contain unpaired surrogates.
NFD decomposition – canonical decomposition to reorder combining marks. This is needed because macOS stores filenames in a form close to NFD, so an NFD input and an NFC input must produce the same result. Decomposing first ensures combining marks are in canonical order before subsequent steps.
Whitespace trimming – strips leading and trailing characters with the Unicode White_Space property, plus the BOM (U+FEFF) and Control Pictures that correspond to whitespace control characters (U+2409–U+240D: HT, LF, VT, FF, CR). Many applications strip leading/trailing whitespace silently, and macOS automatically strips leading BOMs. Control Pictures are included because they are the mapped form of whitespace control characters (see step 4), so trimming must be consistent before and after mapping.
Fullwidth-to-ASCII mapping – maps fullwidth forms (U+FF01–U+FF5E) to their ASCII equivalents (U+0021–U+007E). The Windows OS-compatibility step (see below) maps certain ASCII characters to fullwidth to avoid Windows restrictions. This step ensures that the OS-compatible form normalizes back to the same value.
Control character mapping – maps C0 controls (U+0001–U+001F) and DEL (U+007F) to their Unicode Control Picture equivalents (U+2401–U+241F, U+2421). Control characters are invisible, can break terminals and tools, and some OSes reject or silently drop them. Mapping to visible Control Pictures preserves the information while making the name safe. (Null bytes are excluded — see step 5.)
Validation – rejects empty strings, ., .., names containing /, and names containing null bytes (\0). These are universally special on all OSes and cannot be used as regular names.
NFC composition – canonical composition to produce the shortest equivalent form.

In case-insensitive mode, four additional steps are applied after the above:

NFD decomposition (again, on the NFC result). Steps 7, 8, and 10 implement the Unicode canonical caseless matching algorithm (Definition D145): “A string X is a canonical caseless match for a string Y if and only if: NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))”. Step 9 extends this with a post-case-fold fixup for Turkish/Azerbaijani and Lithuanian casing.
Unicode toCasefold() – locale-independent full case folding.
Post-case-fold fixup – maps U+0131 (ı) to ASCII i, and strips U+0307 COMBINING DOT ABOVE after any Soft_Dotted character (e.g. i, j, Cyrillic і/ј), blocked by intervening starters or CCC=230 Above combiners (matching the Unicode After_Soft_Dotted condition). This neutralizes two locale-specific casing inconsistencies that toCasefold() alone misses:
- Turkish/Azerbaijani: toCasefold() treats ı as distinct from i (ı folds to itself), yet toUppercase(ı) = I even without locale tailoring, and I folds back to i – creating a collision.
- Lithuanian: lowercase adds U+0307 after capital I/J/Į when more accents are above, and upper/titlecase removes U+0307 after soft-dotted characters; stripping it ensures stability under Lithuanian casing.
NFC composition (final) – recompose after case folding to produce the canonical NFC output.

§OS compatibility mapping

Each PathElementGeneric also computes an OS-compatible form suitable for use as an actual path element name on the host operating system. It is derived from the case-sensitive normalized form, by applying the following additional steps:

Windows: the characters and patterns listed in the Windows naming conventions are handled by mapping them to visually similar fullwidth Unicode equivalents: forbidden characters (< > : " \ | ? *), the final trailing dot, and the first character of reserved device names (CON, PRN, AUX, NUL, COM0–COM9, LPT0–LPT9, and their superscript-digit variants).
Apple (macOS/iOS): converted using CFStringGetFileSystemRepresentation as recommended by Apple’s documentation (produces a representation similar to NFD).
Other platforms: the OS-compatible form is identical to the case-sensitive normalized form.

§Types

The core type is PathElementGeneric<'a, S>, parameterized by a case-sensitivity marker S:

PathElementCS = PathElementGeneric<'a, CaseSensitive> – compile-time case-sensitive path element.
PathElementCI = PathElementGeneric<'a, CaseInsensitive> – compile-time case-insensitive path element.
PathElement = PathElementGeneric<'a, CaseSensitivity> – runtime-selected case sensitivity via the CaseSensitivity enum.

Use the typed aliases (PathElementCS, PathElementCI) when the case sensitivity is known at compile time. These implement Hash, which the runtime-dynamic PathElement does not (since hashing elements with different sensitivities into the same map would violate hash/eq consistency).

The zero-sized marker structs CaseSensitive and CaseInsensitive are used as type parameters, while the CaseSensitivity enum provides the same choice at runtime. All three types implement Into<CaseSensitivity>.

§Examples

// NFD input (e + combining acute) composes to NFC (é), whitespace is trimmed
let pe = PathElementCS::new("  cafe\u{0301}.txt  ")?;
assert_eq!(pe.original(), "  cafe\u{0301}.txt  ");
assert_eq!(pe.normalized(), "caf\u{00E9}.txt");

// Case-insensitive: German ß case-folds to "ss"
let pe = PathElementCI::new("Stra\u{00DF}e.txt")?;
assert_eq!(pe.original(), "Stra\u{00DF}e.txt");
assert_eq!(pe.normalized(), "strasse.txt");

The OS-compatible form adapts names for the host filesystem. On Windows, forbidden characters and reserved device names are mapped to safe alternatives; on Apple, names are converted to a form close to NFD:

// A name with a Windows-forbidden character and an accented letter
let pe = PathElementCS::new("caf\u{00E9} 10:30")?;
assert_eq!(pe.normalized(), "caf\u{00E9} 10:30");

#[cfg(target_os = "windows")]
assert_eq!(pe.os_compatible(), "caf\u{00E9} 10\u{FF1A}30"); // : → fullwidth ：

#[cfg(target_vendor = "apple")]
assert_eq!(pe.os_compatible(), "cafe\u{0301} 10:30"); // NFC → NFD

#[cfg(not(any(target_os = "windows", target_vendor = "apple")))]
assert_eq!(pe.os_compatible(), pe.normalized()); // unchanged

Equality is based on the normalized form, so different originals can compare equal:

// NFD (e + combining acute) and NFC (é) normalize to the same form
let a = PathElementCS::new("cafe\u{0301}.txt")?;
let b = PathElementCS::new("caf\u{00E9}.txt")?;
assert_eq!(a, b);
assert_ne!(a.original(), b.original());

The typed variants implement Hash, so they work in both hash-based and ordered collections:

// Turkish İ, dotless ı, ASCII I, and ASCII i all normalize to the same CI form
let names = ["\u{0130}.txt", "\u{0131}.txt", "I.txt", "i.txt"];
let set: HashSet<_> = names.iter().map(|n| PathElementCI::new(*n).unwrap()).collect();
assert_eq!(set.len(), 1);

let tree: BTreeSet<_> = names.iter().map(|n| PathElementCI::new(*n).unwrap()).collect();
assert_eq!(tree.len(), 1);

The runtime-dynamic PathElement works in ordered collections too, but comparing or ordering elements with different case sensitivities will panic:

// "ss", "SS", "sS", "Ss", sharp s (ß), capital sharp s (ẞ)
let names = ["ss", "SS", "sS", "Ss", "\u{00DF}", "\u{1E9E}"];

let cs: BTreeSet<_> = names.iter()
    .map(|n| PathElement::new(*n, CaseSensitive).unwrap())
    .collect();
assert_eq!(cs.len(), 6); // case-sensitive: all distinct

let ci: BTreeSet<_> = names.iter()
    .map(|n| PathElement::new(*n, CaseInsensitive).unwrap())
    .collect();
assert_eq!(ci.len(), 1); // case-insensitive: all normalize to "ss"

§Unicode version

All Unicode operations (NFC, NFD, case folding, property lookups) use Unicode 17.0.0. The Unicode version is considered part of the crate’s stability contract: it will only be updated in a semver-breaking release to ensure that normalization results are consistent across all compatible versions.

§`no_std` support

This crate supports no_std environments. Disable the default std feature:

[dependencies]
normalized-path = { version = "...", default-features = false }

The std feature enables from_os_str constructors and os_str/into_os_str accessors. The alloc crate is always required.

Structs§

CaseInsensitive: Zero-sized type-level marker for case-insensitive normalization.
CaseSensitive: Zero-sized type-level marker for case-sensitive normalization.
Error: An error that occurred during path element normalization or validation.
PathElementGeneric: A validated, normalized single path element.

Enums§

CaseSensitivity: Controls whether path normalization is case-sensitive or case-insensitive.
ErrorKind: The kind of error that occurred during path element normalization or validation.

Type Aliases§

PathElement: A validated, normalized single path element with runtime-selected case sensitivity.
PathElementCI: A validated, normalized, case-insensitive single path element.
PathElementCS: A validated, normalized, case-sensitive single path element.
Result: A Result type alias using this crate’s Error.