Module icu::properties::sets

Expand description

The functions in this module return a CodePointSetData containing the set of characters with a particular Unicode property.

The descriptions of most properties are taken from TR44, the documentation for the Unicode Character Database. Some properties are instead defined in TR18, the documentation for Unicode regular expressions. In particular, Annex C of this document defines properties for POSIX compatibility.

Structs

CodePointSetData

A wrapper around code point set data. It is returned by APIs that return Unicode property data in a set-like form, ex: a set of code points sharing the same value for a Unicode property. Access its data via the borrowed version, CodePointSetDataBorrowed.

CodePointSetDataBorrowed

A borrowed wrapper around code point set data, returned by CodePointSetData::as_borrowed(). More efficient to query.

UnicodeSetData

A wrapper around UnicodeSet data (characters and strings)

UnicodeSetDataBorrowed

A borrowed wrapper around code point set data, returned by UnicodeSetData::as_borrowed(). More efficient to query.

Functions

load_alnum

Characters with the Alphabetic or Decimal_Number property This is defined for POSIX compatibility.

load_alphabetic

Alphabetic characters

load_ascii_hex_digit

ASCII characters commonly used for the representation of hexadecimal numbers

load_basic_emoji

Characters and character sequences intended for general-purpose, independent, direct input. See Unicode Technical Standard #51 for more details.

load_bidi_control

Format control characters which have specific functions in the Unicode Bidirectional Algorithm

load_bidi_mirrored

Characters that are mirrored in bidirectional text

load_blank

Horizontal whitespace characters

load_case_ignorable

Characters which are ignored for casing purposes

load_case_sensitive

Characters that are either the source of a case mapping or in the target of a case mapping

load_cased

Uppercase, lowercase, and titlecase characters

load_changes_when_casefolded

Characters whose normalized forms are not stable under case folding

load_changes_when_casemapped

Characters which may change when they undergo case mapping

load_changes_when_lowercased

Characters whose normalized forms are not stable under a toLowercase mapping

load_changes_when_nfkc_casefolded

Characters which are not identical to their NFKC_Casefold mapping

load_changes_when_titlecased

Characters whose normalized forms are not stable under a toTitlecase mapping

load_changes_when_uppercased

Characters whose normalized forms are not stable under a toUppercase mapping

load_dash

Punctuation characters explicitly called out as dashes in the Unicode Standard, plus their compatibility equivalents

load_default_ignorable_code_point

For programmatic determination of default ignorable code points. New characters that should be ignored in rendering (unless explicitly supported) will be assigned in these ranges, permitting programs to correctly handle the default rendering of such characters when not otherwise supported.

load_deprecated

Deprecated characters. No characters will ever be removed from the standard, but the usage of deprecated characters is strongly discouraged.

load_diacritic

Characters that linguistically modify the meaning of another character to which they apply

load_emoji

Characters that are emoji

load_emoji_component

Characters used in emoji sequences that normally do not appear on emoji keyboards as separate choices, such as base characters for emoji keycaps

load_emoji_modifier

Characters that are emoji modifiers

load_emoji_modifier_base

Characters that can serve as a base for emoji modifiers

load_emoji_presentation

Characters that have emoji presentation by default

load_extended_pictographic

Pictographic symbols, as well as reserved ranges in blocks largely associated with emoji characters

load_extender

Characters whose principal function is to extend the value of a preceding alphabetic character or to extend the shape of adjacent characters.

load_for_general_category_group

Return a CodePointSetData for a value or a grouping of values of the General_Category property. See GeneralCategoryGroup.

load_full_composition_exclusion

Characters that are excluded from composition See https://unicode.org/Public/UNIDATA/CompositionExclusions.txt

load_graph

Visible characters. This is defined for POSIX compatibility.

load_grapheme_base

Property used together with the definition of Standard Korean Syllable Block to define “Grapheme base”. See D58 in Chapter 3, Conformance in the Unicode Standard.

load_grapheme_extend

Property used to define “Grapheme extender”. See D59 in Chapter 3, Conformance in the Unicode Standard.

load_grapheme_link

Deprecated property. Formerly proposed for programmatic determination of grapheme cluster boundaries.

load_hex_digit

Characters commonly used for the representation of hexadecimal numbers, plus their compatibility equivalents

load_hyphen

Deprecated property. Dashes which are used to mark connections between pieces of words, plus the Katakana middle dot.

load_id_continue

Characters that can come after the first character in an identifier. If using NFKC to fold differences between characters, use load_xid_continue instead. See Unicode Standard Annex #31 for more details.

load_id_start

Characters that can begin an identifier. If using NFKC to fold differences between characters, use load_xid_start instead. See Unicode Standard Annex #31 for more details.

load_ideographic

Characters considered to be CJKV (Chinese, Japanese, Korean, and Vietnamese) ideographs, or related siniform ideographs

load_ids_binary_operator

Characters used in Ideographic Description Sequences

load_ids_trinary_operator

Characters used in Ideographic Description Sequences

load_join_control

Format control characters which have specific functions for control of cursive joining and ligation

load_logical_order_exception

A small number of spacing vowel letters occurring in certain Southeast Asian scripts such as Thai and Lao

load_lowercase

Lowercase characters

load_math

Characters used in mathematical notation

load_nfc_inert

Characters that are inert under NFC, i.e., they do not interact with adjacent characters

load_nfd_inert

Characters that are inert under NFD, i.e., they do not interact with adjacent characters

load_nfkc_inert

Characters that are inert under NFKC, i.e., they do not interact with adjacent characters

load_nfkd_inert

Characters that are inert under NFKD, i.e., they do not interact with adjacent characters

load_noncharacter_code_point

Code points permanently reserved for internal use

load_pattern_syntax

Characters used as syntax in patterns (such as regular expressions). See Unicode Standard Annex #31 for more details.

load_pattern_white_space

Characters used as whitespace in patterns (such as regular expressions). See Unicode Standard Annex #31 for more details.

load_prepended_concatenation_mark

A small class of visible format controls, which precede and then span a sequence of other characters, usually digits.

load_print

Printable characters (visible characters and whitespace). This is defined for POSIX compatibility.

load_quotation_mark

Punctuation characters that function as quotation marks.

load_radical

Characters used in the definition of Ideographic Description Sequences

load_regional_indicator

Regional indicator characters, U+1F1E6..U+1F1FF

load_segment_starter

Characters that are starters in terms of Unicode normalization and combining character sequences

load_sentence_terminal

Punctuation characters that generally mark the end of sentences

load_soft_dotted

Characters with a “soft dot”, like i or j. An accent placed on these characters causes the dot to disappear.

load_terminal_punctuation

Punctuation characters that generally mark the end of textual units

load_unified_ideograph

A property which specifies the exact set of Unified CJK Ideographs in the standard

load_uppercase

Uppercase characters

load_variation_selector

Characters that are Variation Selectors.

load_white_space

Spaces, separator characters and other control characters which should be treated by programming languages as “white space” for the purpose of parsing elements

load_xdigit

Hexadecimal digits This is defined for POSIX compatibility.

load_xid_continue

Characters that can come after the first character in an identifier. See Unicode Standard Annex #31 for more details.

load_xid_start

Characters that can begin an identifier. See Unicode Standard Annex #31 for more details.