Expand description
Convert between characters and their standard names.
This crate provides two functions for mapping from a char
to the
name given by the Unicode standard (16.0). There are no runtime
requirements so this is usable with only core
(this requires
specifying the no_std
cargo feature). The tables are heavily
compressed, but still large (500KB), and still offer efficient
O(1)
look-ups in both directions (more precisely, O(length of name)
).
println!("☃ is called {:?}", unicode_names2::name('☃')); // SNOWMAN
println!("{:?} is happy", unicode_names2::character("white smiling face")); // ☺
// (NB. case insensitivity)
§Macros
The associated unicode_names2_macros
crate provides two macros
for converting at compile-time, giving named literals similar to
Python’s "\N{...}"
.
named_char!(name)
takes a single stringname
and creates achar
literal.named!(string)
takes a string and replaces any\\N{name}
sequences with the character with that name. NB. String escape sequences cannot be customised, so the extra backslash (or a raw string) is required, unless you use a raw string.
#![feature(proc_macro_hygiene)]
#[macro_use]
extern crate unicode_names2_macros;
fn main() {
let x: char = named_char!("snowman");
assert_eq!(x, '☃');
let y: &str = named!("foo bar \\N{BLACK STAR} baz qux");
assert_eq!(y, "foo bar ★ baz qux");
let y: &str = named!(r"foo bar \N{BLACK STAR} baz qux");
assert_eq!(y, "foo bar ★ baz qux");
}
§Loose Matching
For name->char retrieval (the character
function and macros) this crate uses loose matching,
as defined in Unicode Standard Annex #441.
In general, this means case, whitespace and underscore characters are ignored, as well as
medial hyphens, which are hyphens (-
) that come between two alphanumeric characters1.
Under this scheme, the query Low_Line
will find U+005F LOW LINE
, as well as l o w L-I-N-E
,
lowline
, and low\nL-I-N-E
, but not low- line
.
Similarly, tibetan letter -a
will find U+0F60 TIBETAN LETTER -A
, as well as
tibetanletter - a
and TIBETAN L_ETTE_R- __a__
, but not tibetan letter-a
or
TIBETAN LETTER A
.
In the implementation of this crate, ‘whitespace’ is determined by the is_ascii_whitespace
method on u8
and char
. See its documentation for more info.
Structs§
- Name
- An iterator over the components of a code point’s name. Notably implements
Display
.