Crate unicode_names2

Source
Expand description

Convert between characters and their standard names.

This crate provides two functions for mapping from a char to the name given by the Unicode standard (16.0). There are no runtime requirements so this is usable with only core (this requires specifying the no_std cargo feature). The tables are heavily compressed, but still large (500KB), and still offer efficient O(1) look-ups in both directions (more precisely, O(length of name)).

    println!("☃ is called {:?}", unicode_names2::name('☃')); // SNOWMAN
    println!("{:?} is happy", unicode_names2::character("white smiling face")); // ☺
    // (NB. case insensitivity)

Source.

§Macros

The associated unicode_names2_macros crate provides two macros for converting at compile-time, giving named literals similar to Python’s "\N{...}".

  • named_char!(name) takes a single string name and creates a char literal.
  • named!(string) takes a string and replaces any \\N{name} sequences with the character with that name. NB. String escape sequences cannot be customised, so the extra backslash (or a raw string) is required, unless you use a raw string.
#![feature(proc_macro_hygiene)]

#[macro_use]
extern crate unicode_names2_macros;

fn main() {
    let x: char = named_char!("snowman");
    assert_eq!(x, '☃');

    let y: &str = named!("foo bar \\N{BLACK STAR} baz qux");
    assert_eq!(y, "foo bar ★ baz qux");

    let y: &str = named!(r"foo bar \N{BLACK STAR} baz qux");
    assert_eq!(y, "foo bar ★ baz qux");
}

§Loose Matching

For name->char retrieval (the character function and macros) this crate uses loose matching, as defined in Unicode Standard Annex #441. In general, this means case, whitespace and underscore characters are ignored, as well as medial hyphens, which are hyphens (-) that come between two alphanumeric characters1.

Under this scheme, the query Low_Line will find U+005F LOW LINE, as well as l o w L-I-N-E, lowline, and low\nL-I-N-E, but not low- line. Similarly, tibetan letter -a will find U+0F60 TIBETAN LETTER -A, as well as tibetanletter - a and TIBETAN L_ETTE_R- __a__, but not tibetan letter-a or TIBETAN LETTER A.

In the implementation of this crate, ‘whitespace’ is determined by the is_ascii_whitespace method on u8 and char. See its documentation for more info.


  1. See UAX44-LM2 for precise details. 

Structs§

Name
An iterator over the components of a code point’s name. Notably implements Display.

Functions§

character
Find the character called name, or None if no such character exists.
name
Find the name of c, or None if c has no name.