1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
//! Opinionated cross-platform, optionally case-insensitive path normalization.
//!
//! This crate provides [`PathElementCS`] (case-sensitive), [`PathElementCI`]
//! (case-insensitive), and [`PathElement`] (runtime-selected) -- types that take a
//! raw path element name, validate it, normalize it to a canonical form, and compute
//! an OS-compatible presentation form.
//!
//! # Design goals and non-goals
//!
//! **Goals:**
//!
//! - The normalization procedure is identical on every platform -- the same input
//! always produces the same normalized bytes regardless of the host OS.
//! - If any supported OS considers two names equivalent (e.g. NFC vs NFD on macOS),
//! they must normalize to the same value.
//! - The normalized form is always in NFC (Unicode Normalization Form C), the
//! most widely used and compact canonical form.
//! - Normalization is idempotent: normalizing an already-normalized name always
//! produces the same name unchanged.
//! - The OS-compatible form of a name, when normalized again, produces the same
//! normalized value as the original input (round-trip stability).
//! - Every valid name is representable on every supported OS. Characters that
//! would be rejected or silently altered (Windows forbidden characters) are
//! mapped to visually similar safe alternatives.
//! - If the OS automatically transforms a name (e.g. NFC↔NFD conversion,
//! truncation at null bytes), normalizing the transformed name produces the
//! same result as normalizing the original.
//! - In case-insensitive mode, names differing only in case normalize identically,
//! including edge cases from Turkish/Azerbaijani and Lithuanian casing rules
//! (see step 8 below).
//!
//! **Non-goals:**
//!
//! - Not every name that a particular OS accepts is considered valid. Non-UTF-8
//! byte sequences, names that normalize to empty (e.g. whitespace-only), and
//! names that normalize to `.` or `..` (e.g. `" .. "`) are always rejected.
//! - A name taken directly from the OS may produce a different OS-compatible form
//! after normalization. For example, a file named `" hello.txt"` (leading space)
//! will have the space trimmed, so its OS-compatible form is `"hello.txt"`.
//! - The OS-compatible form is not guaranteed to be accepted by the OS. For
//! example, it may exceed the OS's path element length limit, or on Apple
//! platforms the filesystem may require names in Unicode Stream-Safe Text Format
//! which the OS-compatible form does not enforce.
//! - Windows 8.3 short file names (e.g. `PROGRA~1`) are not handled.
//! - Visually similar names are not necessarily considered equal. For example,
//! a regular space (U+0020) and a non-breaking space (U+00A0) produce different
//! normalized forms despite looking identical, and the ligature `fi` (U+FB01) is
//! distinct from the two-character sequence `fi`.
//! - Fullwidth and ASCII variants of the same character (e.g. `A` vs `A`) are
//! deliberately normalized to the same form. Users who need to distinguish
//! them cannot use this crate.
//! - In case-insensitive mode, Turkish İ (U+0130), dotless ı (U+0131), and
//! ASCII I/i are all deliberately normalized to the same form. Users who
//! need to distinguish them cannot use case-insensitive mode.
//! - Invalid UTF-8 byte sequences in `from_bytes`/`from_os_str` are rejected.
//! - Names containing unassigned Unicode code points are rejected. This makes it
//! much more likely that normalization results for accepted names remain stable
//! when upgrading to a future Unicode version (see [Unicode version](#unicode-version)).
//! - Path separators and multi-component paths are not handled. This crate
//! operates on a single path element (one name between separators). Support
//! for full paths may be added in a future version.
//! - Android versions before 6 (API level 23) are not supported. Earlier
//! versions used Java Modified UTF-8 for filesystem paths, encoding
//! supplementary characters as CESU-8 surrogate pairs.
//!
//! # Normalization pipeline
//!
//! Every path element name goes through the following steps during construction:
//!
//! 0. **UTF-8 validation** (only for `from_bytes`/`from_os_str`) --
//! the input must be valid UTF-8; invalid byte sequences are rejected with
//! [`ErrorKind::InvalidUtf8`].
//!
//! 1. **NFD decomposition** -- canonical decomposition to reorder combining marks.
//! This is needed because macOS stores filenames in a form close to NFD, so an
//! NFD input and an NFC input must produce the same result. Decomposing first
//! ensures combining marks are in canonical order before subsequent steps.
//!
//! 2. **Whitespace trimming** -- strips leading and trailing characters with the Unicode
//! `White_Space` property (excluding control characters, which are rejected in step 4).
//! Many applications strip leading/trailing whitespace silently.
//!
//! 3. **Fullwidth-to-ASCII mapping** -- maps fullwidth forms (U+FF01--U+FF5E) to their
//! ASCII equivalents (U+0021--U+007E). The Windows OS-compatibility step (see below)
//! maps certain ASCII characters to fullwidth to avoid Windows restrictions. This
//! step ensures that the OS-compatible form normalizes back to the same value.
//!
//! 4. **Validation** -- rejects empty strings, `.`, `..`, names containing `/`,
//! null bytes (`\0`), characters with the Unicode `Control` general category, BOM (U+FEFF), and unassigned
//! Unicode characters. The first group is universally special on all OSes and
//! cannot be used as regular names. Control characters are invisible, can break
//! terminals and tools, and some OSes reject or silently drop them. Unassigned
//! characters are rejected to ensure normalization stability across Unicode
//! versions (see [Unicode stability policies](#unicode-stability-policies)).
//!
//! 5. **NFC composition** -- canonical composition to produce the shortest equivalent
//! form.
//!
//! In **case-insensitive** mode, four additional steps are applied after the above:
//!
//! 6. **NFD decomposition** (again, on the NFC result). Steps 6, 7, and 9
//! implement the Unicode canonical caseless matching algorithm (Definition D145):
//! *"A string X is a canonical caseless match for a string Y if and only if:
//! NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))"*. Step 8 extends this
//! with a post-case-fold fixup for Turkish/Azerbaijani and Lithuanian casing.
//!
//! 7. **Unicode `toCasefold()`** -- locale-independent full case folding.
//!
//! 8. **Post-case-fold fixup** -- maps U+0131 (ı) to ASCII i, and strips
//! U+0307 COMBINING DOT ABOVE after any `Soft_Dotted`
//! character (e.g. i, j, Cyrillic і/ј), blocked by intervening starters or
//! CCC=230 Above combiners (matching the Unicode `After_Soft_Dotted` condition).
//! This neutralizes casing inconsistencies that `toCasefold()` alone misses:
//! - **Dotless ı (U+0131):** `toCasefold()` treats ı as distinct from i
//! (ı folds to itself), yet `toUppercase(ı)` = I even without locale
//! tailoring, and I folds back to i -- creating a collision.
//! - **Lithuanian casing rules:** when lowercasing I/J/Į with additional accents above,
//! Lithuanian rules insert U+0307 to retain the visual dot (e.g.
//! `lt_lowercase("J\u{0301}")` = `j\u{0307}\u{0301}`). Conversely,
//! Lithuanian upper/titlecase removes U+0307 after soft-dotted characters
//! (e.g. `lt_uppercase("j\u{0307}")` = `J`). Stripping U+0307 after
//! soft-dotted characters ensures stability under both directions.
//!
//! 9. **NFC composition** (final) -- recompose after case folding to produce the
//! canonical NFC output.
//!
//! # OS compatibility mapping
//!
//! Each `PathElementGeneric` also computes an **OS-compatible** form suitable for
//! use as an actual path element name on the host operating system. It is derived
//! from the case-sensitive normalized form, by applying the following additional
//! steps:
//!
//! - **Windows**: the characters and patterns listed in the Windows
//! [naming conventions](https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions)
//! are handled by mapping them to visually similar fullwidth Unicode equivalents:
//! forbidden characters (`< > : " \ | ? *`), the final trailing dot, and the first
//! character of reserved device names (CON, PRN, AUX, NUL, COM0--COM9, LPT0--LPT9,
//! and their superscript-digit variants).
//! - **Apple (macOS/iOS)**: converted using [`CFStringGetFileSystemRepresentation`](https://developer.apple.com/documentation/corefoundation/cfstringgetfilesystemrepresentation(_:_:_:))
//! as recommended by Apple's documentation (produces a representation similar to NFD).
//! - **Other platforms**: the OS-compatible form is identical to the case-sensitive
//! normalized form.
//!
//! # Types
//!
//! The core type is [`PathElementGeneric<'a, S>`], parameterized by a case-sensitivity
//! marker `S`:
//!
//! - [`PathElementCS`] = `PathElementGeneric<'a, CaseSensitive>` -- compile-time
//! case-sensitive path element.
//! - [`PathElementCI`] = `PathElementGeneric<'a, CaseInsensitive>` -- compile-time
//! case-insensitive path element.
//! - [`PathElement`] = `PathElementGeneric<'a, CaseSensitivity>` -- runtime-selected
//! case sensitivity via the [`CaseSensitivity`] enum.
//!
//! Use the typed aliases ([`PathElementCS`], [`PathElementCI`]) when the case sensitivity
//! is known at compile time. These implement [`Hash`](core::hash::Hash), which the
//! runtime-dynamic [`PathElement`] does not (since hashing elements with different
//! sensitivities into the same map would violate hash/eq consistency).
//!
//! The zero-sized marker structs [`CaseSensitive`] and [`CaseInsensitive`] are used as
//! type parameters, while the [`CaseSensitivity`] enum provides the same choice at runtime.
//! All three types implement `Into<CaseSensitivity>`.
//!
//! # Unicode version
//!
//! All Unicode operations (NFC, NFD, case folding, property lookups) use
//! **Unicode 17.0.0**. Updating to a newer Unicode version is not considered a
//! semver-breaking change as long as the normalization pipeline produces identical
//! results for all strings consisting of characters assigned in the previous version.
//! If a new Unicode version were to change normalization results for previously
//! assigned characters, that update would be a semver-breaking change.
//!
//! This is unlikely — though not formally guaranteed — due to the following
//! [Character Encoding Stability Policies](https://www.unicode.org/policies/stability_policy.html):
//! - If a string contains only characters from a given version of Unicode, and it
//! is put into a normalized form in accordance with that version of Unicode, then
//! the results will be identical to the results of putting that string into a
//! normalized form in accordance with any subsequent version of Unicode.
//! - Once a character is assigned, its canonical combining class will not change.
//! - Once a character is encoded, its properties may still be changed, but not in
//! such a way as to change the fundamental identity of the character.
//! - For each string S containing only assigned characters in a given Unicode
//! version, `toCasefold(toNFKC(S))` under that version is identical to
//! `toCasefold(toNFKC(S))` under any later version of Unicode.
//!
//! # `no_std` support
//!
//! This crate supports `no_std` environments. Disable the default `std` feature:
//!
//! ```toml
//! [dependencies]
//! normalized-path = { version = "...", default-features = false }
//! ```
//!
//! The `std` feature enables `from_os_str` constructors and
//! `os_str`/`into_os_str` accessors. The `alloc` crate is always required.
//!
//! # Examples
//!
//! ```
//! # use normalized_path::{PathElementCS, PathElementCI};
//! // NFD input (e + combining acute) composes to NFC (é), whitespace is trimmed
//! let pe = PathElementCS::new(" cafe\u{0301}.txt ")?;
//! assert_eq!(pe.original(), " cafe\u{0301}.txt ");
//! assert_eq!(pe.normalized(), "caf\u{00E9}.txt");
//!
//! // Case-insensitive: German ß case-folds to "ss"
//! let pe = PathElementCI::new("Stra\u{00DF}e.txt")?;
//! assert_eq!(pe.original(), "Stra\u{00DF}e.txt");
//! assert_eq!(pe.normalized(), "strasse.txt");
//! # Ok::<(), normalized_path::Error>(())
//! ```
//!
//! The OS-compatible form adapts names for the host filesystem. On Windows,
//! forbidden characters and reserved device names are mapped to safe alternatives;
//! on Apple, names are converted to a form close to NFD:
//!
//! ```
//! # use normalized_path::PathElementCS;
//! // A name with a Windows-forbidden character and an accented letter
//! let pe = PathElementCS::new("caf\u{00E9} 10:30")?;
//! assert_eq!(pe.normalized(), "caf\u{00E9} 10:30");
//!
//! #[cfg(target_os = "windows")]
//! assert_eq!(pe.os_compatible(), "caf\u{00E9} 10\u{FF1A}30"); // : → fullwidth :
//!
//! #[cfg(target_vendor = "apple")]
//! assert_eq!(pe.os_compatible(), "cafe\u{0301} 10:30"); // NFC → NFD
//!
//! #[cfg(not(any(target_os = "windows", target_vendor = "apple")))]
//! assert_eq!(pe.os_compatible(), pe.normalized()); // unchanged
//! # Ok::<(), normalized_path::Error>(())
//! ```
//!
//! Equality is based on the normalized form, so different originals can compare equal:
//!
//! ```
//! # use normalized_path::PathElementCS;
//! // NFD (e + combining acute) and NFC (é) normalize to the same form
//! let a = PathElementCS::new("cafe\u{0301}.txt")?;
//! let b = PathElementCS::new("caf\u{00E9}.txt")?;
//! assert_eq!(a, b);
//! assert_ne!(a.original(), b.original());
//! # Ok::<(), normalized_path::Error>(())
//! ```
//!
//! The typed variants implement [`Hash`](core::hash::Hash), so they work in
//! both hash-based and ordered collections:
//!
//! ```
//! # use std::collections::{BTreeSet, HashSet};
//! # use normalized_path::PathElementCI;
//! // Turkish İ, dotless ı, ASCII I, and ASCII i all normalize to the same CI form
//! let names = ["\u{0130}.txt", "\u{0131}.txt", "I.txt", "i.txt"];
//! let set: HashSet<_> = names.iter().map(|n| PathElementCI::new(*n).unwrap()).collect();
//! assert_eq!(set.len(), 1);
//!
//! let tree: BTreeSet<_> = names.iter().map(|n| PathElementCI::new(*n).unwrap()).collect();
//! assert_eq!(tree.len(), 1);
//! ```
//!
//! The runtime-dynamic [`PathElement`] works in ordered collections too, but
//! comparing or ordering elements with **different** case sensitivities will panic:
//!
//! ```
//! # use std::collections::BTreeSet;
//! # use normalized_path::{PathElement, CaseSensitive, CaseInsensitive};
//! // "ss", "SS", "sS", "Ss", sharp s (ß), capital sharp s (ẞ)
//! let names = ["ss", "SS", "sS", "Ss", "\u{00DF}", "\u{1E9E}"];
//!
//! let cs: BTreeSet<_> = names.iter()
//! .map(|n| PathElement::new(*n, CaseSensitive).unwrap())
//! .collect();
//! assert_eq!(cs.len(), 6); // case-sensitive: all distinct
//!
//! let ci: BTreeSet<_> = names.iter()
//! .map(|n| PathElement::new(*n, CaseInsensitive).unwrap())
//! .collect();
//! assert_eq!(ci.len(), 1); // case-insensitive: all normalize to "ss"
//! ```
extern crate alloc;
pub use ;
pub use ;
pub use ;