Expand description

Unicode properties

This API provides definitions of Unicode Properties and functions for retrieving property data in an appropriate data structure.

APIs that return a UnicodeSet exist for binary properties and certain enumerated properties. See the sets module for more details.

APIs that return a CodePointTrie exist for certain enumerated properties. See the maps module for more details.

Examples

Property data as UnicodeSets

use icu::properties::{maps, sets, GeneralCategory};

let provider = icu_testdata::get_provider();

// A binary property as a `UnicodeSet`

let payload =
    sets::get_emoji(&provider)
        .expect("The data should be valid");
let data_struct = payload.get();
let emoji = &data_struct.inv_list;

assert!(emoji.contains('🎃'));  // U+1F383 JACK-O-LANTERN
assert!(!emoji.contains('木'));  // U+6728

// An individual enumerated property value as a `UnicodeSet`

let payload = maps::get_general_category(&provider).expect("The data should be valid");
let data_struct = payload.get();
let gc = &data_struct.code_point_trie;
let line_sep = gc.get_set_for_value(GeneralCategory::LineSeparator);

assert!(line_sep.contains_u32(0x2028));
assert!(!line_sep.contains_u32(0x2029));

Property data as CodePointTries

use icu::properties::{maps, Script};

let provider = icu_testdata::get_provider();

let payload =
    maps::get_script(&provider)
        .expect("The data should be valid");
let data_struct = payload.get();
let script = &data_struct.code_point_trie;

assert_eq!(script.get('🎃' as u32), Script::Common);  // U+1F383 JACK-O-LANTERN
assert_eq!(script.get('木' as u32), Script::Han);  // U+6728

Property data for Script and Script_Extensions

use icu::properties::{script, Script};

let provider = icu_testdata::get_provider();

let payload =
    script::get_script_with_extensions(&provider)
        .expect("The data should be valid");
let data_struct = payload.get();
let swe = &data_struct.data;

// get the `Script` property value
assert_eq!(swe.get_script_val(0x0650), Script::Inherited); // U+0650 ARABIC KASRA
assert_eq!(swe.get_script_val(0x0660), Script::Arabic); // U+0660 ARABIC-INDIC DIGIT ZERO

// get the `Script_Extensions` property value
assert_eq!(
    swe.get_script_extensions_val(0x0640) // U+0640 ARABIC TATWEEL
        .iter().collect::<Vec<Script>>(),
    vec![Script::Arabic, Script::Syriac, Script::Mandaic, Script::Manichaean,
         Script::PsalterPahlavi, Script::Adlam, Script::HanifiRohingya, Script::Sogdian,
         Script::OldUyghur]
);
assert_eq!(
    swe.get_script_extensions_val('௫' as u32) // U+0BEB TAMIL DIGIT FIVE
        .iter().collect::<Vec<Script>>(),
    vec![Script::Tamil, Script::Grantha]
);

// check containment of a `Script` value in the `Script_Extensions` value
// U+0650 ARABIC KASRA
assert!(swe.has_script(0x0650, Script::Arabic));
assert!(swe.has_script(0x0650, Script::Syriac));

// get a `UnicodeSet` for when `Script` value is contained in `Script_Extensions` value
let syriac = swe.get_script_extensions_set(Script::Syriac);
assert!(syriac.contains_u32(0x0650)); // ARABIC KASRA
assert!(!syriac.contains_u32(0x0660)); // ARABIC-INDIC DIGIT ZERO

Modules

The functions in this module return a CodePointTrie representing, for each code point in the entire range of code points, the property values for a particular Unicode property.

Data provider struct definitions for this ICU4X component.

Data and APIs for supporting both Script and Script_Extensions property values in an efficient structure.

The functions in this module return a UnicodeSet containing the set of characters with a particular Unicode property.

Structs

Enumerated property Bidi_Class

Property Canonical_Combining_Class. See UAX #15: https://www.unicode.org/reports/tr15/.

Enumerated property East_Asian_Width.

Groupings of multiple General_Category property values.

Enumerated property Grapheme_Cluster_Break.

Enumerated property Line_Break.

Enumerated property Script.

Enumerated property Sentence_Break. See “Default Sentence Boundary Specification” in UAX #29 for the summary of each property value: https://www.unicode.org/reports/tr29/#Default_Word_Boundaries.

Enumerated property Word_Break.

Enums

Selection constants for Unicode properties. These constants are used to select one of the Unicode properties. See UProperty in ICU4C.

Enumerated property General_Category.