1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
//! Converts Unicode strings to ones containing only characters from the [POSIX Portable File Name Character
//! Set (Wikipedia)](https://en.wikipedia.org/wiki/Portable_character_set#Portable_filename_character_set).
//! Other characters are converted to the closest ASCII representation using
//! [deunicode (docs.rs)](https://docs.rs/deunicode/latest/deunicode/) where possible and removed otherwise,
//! and delimiters are automatically inserted where necessary using an algorithm described further down. The
//! converted strings may then be used as user-facing filenames or keys in systems where portability is
//! required.
//!
//! There are primarily two APIs:
//! - `posix_string::convert`: performs a conversion from Unicode to POSIX portable characters.
//! - `posix_string::convert_filename`: performs the same conversion as above and additionally enforces a
//! maximum length of 255 while attempting to leave the extension unchanged.
//!
//! Each API also has a non-allocating `*_iter` variant which take a `char` iterator input and produce a `char`
//! iterator output.
//!
//! ## Examples
//! ```
//! assert_eq!(posix_string::convert("Horsey 🦄🦄"), "horsey_unicorn_unicorn");
//! assert_eq!(posix_string::convert("Næstved"), "naestved");
//! assert_eq!(posix_string::convert("晒后假日"), "shai_hou_jia_ri");
//! assert_eq!(posix_string::convert("Београд - Добановци"), "beograd-dobanovtsi");
//! assert_eq!(posix_string::convert(" 🌵 . 🌵 Prickly/delimiters 🌵!"), "cactus.cactus_prickly_delimiters_cactus");
//!
//! assert_eq!(posix_string::convert_filename("Güneş Sonrası", b"json"), "gunes_sonrasi.json");
//! assert_eq!(posix_string::convert_filename("😃-My filename-😃", b".toml"), "smiley-my_filename-smiley.toml");
//! ```
//!
//! ## Delimiter insertion algorithm
//! The goal is to insert a delimiting `_` before and after a conversion like `😃 → smiley` to ensure
//! that e.g. `晒后假日` gets converted as `shai_hou_jia_ri` and not `shaihoujiari`. However, this can't be
//! done carte blanche since the input symbol may already be surrounded by a delimiter or string terminals;
//! e.g., we would otherwise get conversions like `😃.😃 → _smiley__smiley_`. Instead of inserting a
//! delimiter directly, we therefore insert a special _marker_ character, which indicates "we here need a
//! delimiter". A marker is then reified as a delimiting `_` if both the following conditions are met:
//! - The next character is not a delimiter, string terminal, or marker.
//! - The previous non-marker character is not a delimiter or string terminal.
//!
//! The "non-marker" clauses above ensure that multiple sequential markers get reified as at most one
//! delimiter.
//!
//! The markers are inserted around a conversion if one of the following conditions are met:
//! - There was no viable conversion. If so, we assume that the input character was non-alphabetic and is
//! therefore best represented as a delimiter.
//! - The conversion has length > 1 and was from a non-alphabetic input character. This ensures that we're
//! not adding markers around e.g. `ä` in `aäa → aaa` or `æ` in `aæa → aaea`.
//!
//! Additionally, input characters can be wholly replaced with a marker if it's an ASCII symbol not among the
//! allowed ones (`._-`). We do this instead of directly replacing them with an allowed delimiter since it
//! ensures that multiple sequential non-allowed symbols are replaced with at most one delimiter. E.g.,
//! `a!"#b` gets converted as `a_b` and not `a___b`.
//!
//! Note that there are simpler ways of ensuring the same delimiter requirements by creating an intermediate
//! buffer and filtering superfluous delimiters, but this would require dynamic allocations.
/// Converts Unicode characters to POSIX portable characters inside an iterator.
///
/// See the [crate-level documentation](self) for more information.
///
/// The output may be longer or shorter than the input. No internal allocations are performed.
///
/// # Examples
/// ```
/// assert!(posix_string::convert_iter("Horsey 🦄🦄".chars()).eq("horsey_unicorn_unicorn".chars()));
/// assert!(posix_string::convert_iter("Београд - Добановци".chars()).eq("beograd-dobanovtsi".chars()));
/// ```
/// Converts Unicode characters to POSIX portable characters inside a string.
///
/// See the [crate-level documentation](self) for more information.
///
/// The output may be longer or shorter than the input. The only internal allocation is the output string.
/// See [`convert_iter`] for a completely non-allocating variant.
///
/// # Examples
/// ```
/// assert_eq!(posix_string::convert("Horsey 🦄🦄"), "horsey_unicorn_unicorn");
/// assert_eq!(posix_string::convert("Næstved"), "naestved");
/// assert_eq!(posix_string::convert("晒后假日"), "shai_hou_jia_ri");
/// assert_eq!(posix_string::convert("Београд - Добановци"), "beograd-dobanovtsi");
/// assert_eq!(posix_string::convert(" 🌵 . 🌵 Prickly/delimiters 🌵!"), "cactus.cactus_prickly_delimiters_cactus");
/// ```
/// Converts Unicode characters to POSIX portable characters inside a filename iterator, and appends an
/// extension while ensuring that the output does not exceed 255 characters.
///
/// See the [crate-level documentation](self) for more information on the conversion.
///
/// If characters must be removed, they are removed from the filename first such that the extension remains
/// unchanged. If the extension is ≥ 255 characters long, characters are removed from the extension instead.
/// The extension is assumed to be composed of POSIX portable characters, and a leading `.` is automatically
/// inserted if missing.
///
///
/// The output may be longer or shorter than the input. No internal allocations are performed.
///
/// # Examples
/// ```
/// assert_eq!(posix_string::convert_filename("Güneş Sonrası", b"json"), "gunes_sonrasi.json");
/// assert_eq!(posix_string::convert_filename("😃-My filename-😃", b".toml"), "smiley-my_filename-smiley.toml");
/// ```
/// Converts Unicode characters to POSIX portable characters inside a filename, and appends an extension
/// while ensuring that the output does not exceed 255 characters.
///
/// See the [crate-level documentation](self) for more information on the conversion.
///
/// If characters must be removed, they are removed from the filename first such that the extension remains
/// unchanged. If the extension is ≥ 255 characters long, characters are removed from the extension instead.
/// The extension is assumed to be composed of POSIX portable characters, and a leading `.` is automatically
/// inserted if missing.
///
/// The output may be longer or shorter than the input. The only internal allocation is the output string.
/// See [`convert_filename_iter`] for a completely non-allocating variant.
///
/// # Examples
/// ```
/// assert_eq!(posix_string::convert_filename("Güneş Sonrası", b"json"), "gunes_sonrasi.json");
/// assert_eq!(posix_string::convert_filename("😃-My filename-😃", b".toml"), "smiley-my_filename-smiley.toml");
/// ```
/// The output may be longer or shorter than the input. The only internal allocation is the output string.
/// See [`convert_filename_iter`] for a non-allocating variant.