Module icu::properties
source · [−]Expand description
Unicode properties
This API provides definitions of Unicode Properties and functions for retrieving property data in an appropriate data structure.
APIs that return a UnicodeSet
exist for binary properties and certain enumerated
properties. See the sets
module for more details.
APIs that return a CodePointTrie
exist for certain enumerated properties. See the
maps
module for more details.
Examples
Property data as UnicodeSet
s
use icu::properties::{maps, sets, GeneralCategory};
let provider = icu_testdata::get_provider();
// A binary property as a `UnicodeSet`
let payload =
sets::get_emoji(&provider)
.expect("The data should be valid");
let data_struct = payload.get();
let emoji = &data_struct.inv_list;
assert!(emoji.contains('🎃')); // U+1F383 JACK-O-LANTERN
assert!(!emoji.contains('木')); // U+6728
// An individual enumerated property value as a `UnicodeSet`
let payload = maps::get_general_category(&provider).expect("The data should be valid");
let data_struct = payload.get();
let gc = &data_struct.code_point_trie;
let line_sep = gc.get_set_for_value(GeneralCategory::LineSeparator);
assert!(line_sep.contains_u32(0x2028));
assert!(!line_sep.contains_u32(0x2029));
Property data as CodePointTrie
s
use icu::properties::{maps, Script};
let provider = icu_testdata::get_provider();
let payload =
maps::get_script(&provider)
.expect("The data should be valid");
let data_struct = payload.get();
let script = &data_struct.code_point_trie;
assert_eq!(script.get('🎃' as u32), Script::Common); // U+1F383 JACK-O-LANTERN
assert_eq!(script.get('木' as u32), Script::Han); // U+6728
Property data for Script
and Script_Extensions
use icu::properties::{script, Script};
let provider = icu_testdata::get_provider();
let payload =
script::get_script_with_extensions(&provider)
.expect("The data should be valid");
let data_struct = payload.get();
let swe = &data_struct.data;
// get the `Script` property value
assert_eq!(swe.get_script_val(0x0650), Script::Inherited); // U+0650 ARABIC KASRA
assert_eq!(swe.get_script_val(0x0660), Script::Arabic); // U+0660 ARABIC-INDIC DIGIT ZERO
// get the `Script_Extensions` property value
assert_eq!(
swe.get_script_extensions_val(0x0640) // U+0640 ARABIC TATWEEL
.iter().collect::<Vec<Script>>(),
vec![Script::Arabic, Script::Syriac, Script::Mandaic, Script::Manichaean,
Script::PsalterPahlavi, Script::Adlam, Script::HanifiRohingya, Script::Sogdian,
Script::OldUyghur]
);
assert_eq!(
swe.get_script_extensions_val('௫' as u32) // U+0BEB TAMIL DIGIT FIVE
.iter().collect::<Vec<Script>>(),
vec![Script::Tamil, Script::Grantha]
);
// check containment of a `Script` value in the `Script_Extensions` value
// U+0650 ARABIC KASRA
assert!(swe.has_script(0x0650, Script::Arabic));
assert!(swe.has_script(0x0650, Script::Syriac));
// get a `UnicodeSet` for when `Script` value is contained in `Script_Extensions` value
let syriac = swe.get_script_extensions_set(Script::Syriac);
assert!(syriac.contains_u32(0x0650)); // ARABIC KASRA
assert!(!syriac.contains_u32(0x0660)); // ARABIC-INDIC DIGIT ZERO
Modules
The functions in this module return a CodePointTrie
representing, for
each code point in the entire range of code points, the property values
for a particular Unicode property.
Data provider struct definitions for this ICU4X component.
Data and APIs for supporting both Script and Script_Extensions property values in an efficient structure.
The functions in this module return a UnicodeSet
containing
the set of characters with a particular Unicode property.
Structs
Enumerated property Bidi_Class
Property Canonical_Combining_Class. See UAX #15: https://www.unicode.org/reports/tr15/.
Enumerated property East_Asian_Width.
Groupings of multiple General_Category property values.
Enumerated property Grapheme_Cluster_Break.
Enumerated property Line_Break.
Enumerated property Script.
Enumerated property Sentence_Break. See “Default Sentence Boundary Specification” in UAX #29 for the summary of each property value: https://www.unicode.org/reports/tr29/#Default_Word_Boundaries.
Enumerated property Word_Break.
Enums
Selection constants for Unicode properties.
These constants are used to select one of the Unicode properties.
See UProperty
in ICU4C.
Enumerated property General_Category.