Module icu::properties::sets
source · Expand description
The functions in this module return a CodePointSetData
containing
the set of characters with a particular Unicode property.
The descriptions of most properties are taken from TR44
, the documentation for the
Unicode Character Database. Some properties are instead defined in TR18
, the
documentation for Unicode regular expressions. In particular, Annex C of this document
defines properties for POSIX compatibility.
Structs
A wrapper around code point set data. It is returned by APIs that return Unicode
property data in a set-like form, ex: a set of code points sharing the same
value for a Unicode property. Access its data via the borrowed version,
CodePointSetDataBorrowed
.A borrowed wrapper around code point set data, returned by
CodePointSetData::as_borrowed()
. More efficient to query.A wrapper around
UnicodeSet
data (characters and strings)A borrowed wrapper around code point set data, returned by
UnicodeSetData::as_borrowed()
. More efficient to query.Functions
Characters with the Alphabetic or Decimal_Number property
This is defined for POSIX compatibility.
Alphabetic characters
ASCII characters commonly used for the representation of hexadecimal numbers
Characters and character sequences intended for general-purpose, independent, direct input.
See
Unicode Technical Standard #51
for more
details.Format control characters which have specific functions in the Unicode Bidirectional
Algorithm
Characters that are mirrored in bidirectional text
Horizontal whitespace characters
Characters which are ignored for casing purposes
Characters that are either the source of a case mapping or in the target of a case
mapping
Uppercase, lowercase, and titlecase characters
Characters whose normalized forms are not stable under case folding
Characters which may change when they undergo case mapping
Characters whose normalized forms are not stable under a toLowercase mapping
Characters which are not identical to their NFKC_Casefold mapping
Characters whose normalized forms are not stable under a toTitlecase mapping
Characters whose normalized forms are not stable under a toUppercase mapping
Punctuation characters explicitly called out as dashes in the Unicode Standard, plus
their compatibility equivalents
For programmatic determination of default ignorable code points. New characters that
should be ignored in rendering (unless explicitly supported) will be assigned in these
ranges, permitting programs to correctly handle the default rendering of such
characters when not otherwise supported.
Deprecated characters. No characters will ever be removed from the standard, but the
usage of deprecated characters is strongly discouraged.
Characters that linguistically modify the meaning of another character to which they apply
Characters that are emoji
Characters used in emoji sequences that normally do not appear on emoji keyboards as
separate choices, such as base characters for emoji keycaps
Characters that are emoji modifiers
Characters that can serve as a base for emoji modifiers
Characters that have emoji presentation by default
Pictographic symbols, as well as reserved ranges in blocks largely associated with
emoji characters
Characters whose principal function is to extend the value of a preceding alphabetic
character or to extend the shape of adjacent characters.
Return a
CodePointSetData
for a value or a grouping of values of the General_Category property. See GeneralCategoryGroup
.Characters that are excluded from composition
See https://unicode.org/Public/UNIDATA/CompositionExclusions.txt
Visible characters.
This is defined for POSIX compatibility.
Property used together with the definition of Standard Korean Syllable Block to define
“Grapheme base”. See D58 in Chapter 3, Conformance in the Unicode Standard.
Property used to define “Grapheme extender”. See D59 in Chapter 3, Conformance in the
Unicode Standard.
Deprecated property. Formerly proposed for programmatic determination of grapheme
cluster boundaries.
Characters commonly used for the representation of hexadecimal numbers, plus their
compatibility equivalents
Deprecated property. Dashes which are used to mark connections between pieces of
words, plus the Katakana middle dot.
Characters that can come after the first character in an identifier. If using NFKC to
fold differences between characters, use
load_xid_continue
instead. See
Unicode Standard Annex #31
for
more details.Characters that can begin an identifier. If using NFKC to fold differences between
characters, use
load_xid_start
instead. See Unicode Standard Annex #31
for more details.Characters considered to be CJKV (Chinese, Japanese, Korean, and Vietnamese)
ideographs, or related siniform ideographs
Characters used in Ideographic Description Sequences
Characters used in Ideographic Description Sequences
Format control characters which have specific functions for control of cursive joining
and ligation
A small number of spacing vowel letters occurring in certain Southeast Asian scripts such as Thai and Lao
Lowercase characters
Characters used in mathematical notation
Characters that are inert under NFC, i.e., they do not interact with adjacent characters
Characters that are inert under NFD, i.e., they do not interact with adjacent characters
Characters that are inert under NFKC, i.e., they do not interact with adjacent characters
Characters that are inert under NFKD, i.e., they do not interact with adjacent characters
Code points permanently reserved for internal use
Characters used as syntax in patterns (such as regular expressions). See
Unicode Standard Annex #31
for more
details.Characters used as whitespace in patterns (such as regular expressions). See
Unicode Standard Annex #31
for
more details.A small class of visible format controls, which precede and then span a sequence of
other characters, usually digits.
Printable characters (visible characters and whitespace).
This is defined for POSIX compatibility.
Punctuation characters that function as quotation marks.
Characters used in the definition of Ideographic Description Sequences
Regional indicator characters, U+1F1E6..U+1F1FF
Characters that are starters in terms of Unicode normalization and combining character
sequences
Punctuation characters that generally mark the end of sentences
Characters with a “soft dot”, like i or j. An accent placed on these characters causes
the dot to disappear.
Punctuation characters that generally mark the end of textual units
A property which specifies the exact set of Unified CJK Ideographs in the standard
Uppercase characters
Characters that are Variation Selectors.
Spaces, separator characters and other control characters which should be treated by
programming languages as “white space” for the purpose of parsing elements
Hexadecimal digits
This is defined for POSIX compatibility.
Characters that can come after the first character in an identifier. See
Unicode Standard Annex #31
for more details.Characters that can begin an identifier. See
Unicode Standard Annex #31
for more
details.