Expand description
Converts Unicode strings to ones containing only characters from the POSIX Portable File Name Character Set (Wikipedia). Other characters are converted to the closest ASCII representation using deunicode (docs.rs) where possible and removed otherwise, and delimiters are automatically inserted where necessary using an algorithm described further down. The converted strings may then be used as user-facing filenames or keys in systems where portability is required.
There are primarily two APIs:
posix_string::convert: performs a conversion from Unicode to POSIX portable characters.posix_string::convert_filename: performs the same conversion as above and additionally enforces a maximum length of 255 while attempting to leave the extension unchanged.
Each API also has a non-allocating *_iter variant which take a char iterator input and produce a char
iterator output.
§Examples
assert_eq!(posix_string::convert("Horsey 🦄🦄"), "horsey_unicorn_unicorn");
assert_eq!(posix_string::convert("Næstved"), "naestved");
assert_eq!(posix_string::convert("晒后假日"), "shai_hou_jia_ri");
assert_eq!(posix_string::convert("Београд - Добановци"), "beograd-dobanovtsi");
assert_eq!(posix_string::convert(" 🌵 . 🌵 Prickly/delimiters 🌵!"), "cactus.cactus_prickly_delimiters_cactus");
assert_eq!(posix_string::convert_filename("Güneş Sonrası", b"json"), "gunes_sonrasi.json");
assert_eq!(posix_string::convert_filename("😃-My filename-😃", b".toml"), "smiley-my_filename-smiley.toml");§Delimiter insertion algorithm
The goal is to insert a delimiting _ before and after a conversion like 😃 → smiley to ensure
that e.g. 晒后假日 gets converted as shai_hou_jia_ri and not shaihoujiari. However, this can’t be
done carte blanche since the input symbol may already be surrounded by a delimiter or string terminals;
e.g., we would otherwise get conversions like 😃.😃 → _smiley__smiley_. Instead of inserting a
delimiter directly, we therefore insert a special marker character, which indicates “we here need a
delimiter”. A marker is then reified as a delimiting _ if both the following conditions are met:
- The next character is not a delimiter, string terminal, or marker.
- The previous non-marker character is not a delimiter or string terminal.
The “non-marker” clauses above ensure that multiple sequential markers get reified as at most one delimiter.
The markers are inserted around a conversion if one of the following conditions are met:
- There was no viable conversion. If so, we assume that the input character was non-alphabetic and is therefore best represented as a delimiter.
- The conversion has length > 1 and was from a non-alphabetic input character. This ensures that we’re
not adding markers around e.g.
äinaäa → aaaoræinaæa → aaea.
Additionally, input characters can be wholly replaced with a marker if it’s an ASCII symbol not among the
allowed ones (._-). We do this instead of directly replacing them with an allowed delimiter since it
ensures that multiple sequential non-allowed symbols are replaced with at most one delimiter. E.g.,
a!"#b gets converted as a_b and not a___b.
Note that there are simpler ways of ensuring the same delimiter requirements by creating an intermediate buffer and filtering superfluous delimiters, but this would require dynamic allocations.
Functions§
- convert
- Converts Unicode characters to POSIX portable characters inside a string.
- convert_
filename - Converts Unicode characters to POSIX portable characters inside a filename, and appends an extension while ensuring that the output does not exceed 255 characters.
- convert_
filename_ iter - Converts Unicode characters to POSIX portable characters inside a filename iterator, and appends an extension while ensuring that the output does not exceed 255 characters.
- convert_
iter - Converts Unicode characters to POSIX portable characters inside an iterator.