Expand description
An implementation of the Grapheme Cluster portion of UAX #29: Unicode Text Segmentation that prioritizes streaming-friendliness and simplicity.
This library implements the segmentation algorithm as of Unicode 16.0.0, using the character database tables from that release.
GraphemeMachine
is the main type in this library. Construct an object
of that type and then feed it characters from a stream one at a time, and
in return it will tell you for each new character whether it should be
treated as an extension of the current grapheme cluster or the beginning
of a new one. That’s all there is to it!
The canonical Rust library for UAX #29 is
unicode_segmentation
,
and so that’s actually probably what you should use in most cases. This
library has the following main distinctions (as of
unicode_segmentation
v1.12.0):
-
The primary entry point for grapheme clusters in
unicode_segmentation
isGraphemes
, which expects the entire text to be in memory as a single buffer.The library also offers
GraphemeCursor
for working with non-contiguous buffers, but it has a rather challenging API and is difficult to use in a completely streaming manner, with the caller required to sometimes provide earlier context to help it make a decision.By contrast,
GraphemeMachine
in this library is a finite state machine that is advanced one character at a time, with no requirement for the caller to do any buffering at all. Of course in practice it’s likely that a normal caller will need to at least buffer the current grapheme cluster so it can be used once finally split, but how to manage that is left entirely up to the caller.For example, a caller could decide that it only cares about grapheme clusters up to some reasonable maximum length, after which it will just assume malicious or corrupt input and use the Unicode replacement character instead. The
GraphemeMachine
can still allow that caller to find the end of that overlong grapheme cluster and begin consuming the next one even though the caller is no longer including any new characters into its buffer. -
unicode_segmentation
finds the relevant Unicode character properties for incoming characters using binary search over its internal tables, after converting the character into a Rustchar
value.GraphemeMachine
instead prefers to work with UTF-8 encoded characters as represented byu8char
, which can be more cheaply extracted from and appended to Rust strings. The character property lookup is done using a trie based on the UTF-8 byte sequence, and so is potentially faster when you’re chomping UTF-8 sequences from astr
buffer one at a time.(That’s not necessarily true, though. Measure it yourself with the text you want to segment if performance is important to you!)
-
Although
GraphemeMachine
can work withchar
andu8char
values representing specific characters, the segmentation algorithm is actually defined in terms of groups of characters that share similar properties.This library exposes those categories as part of its public API using
CharProperties
,GCBProperty
, andInCBProperty
, and so it could be useful purely as a character property lookup library even if you don’t useGraphemeMachine
, or you could even choose to use your own tailored character property tables and passCharProperties
values directly to aGraphemeMachine
object.
Unless you have a good reason to prefer this library though, it’s probably
better to use
unicode_segmentation
because it’s widely-used in the Rust community, well-maintained by an
established team (whereas this library has only a single,
easily-distracted author), and probably not subject to the important caveat
described in the following section.
§An important caveat
The author originally wrote the code and lookup tables in this library internally within another project, and then proceeded to copy it into several other projects that needed grapheme cluster segmentation. This library is the result of finally getting around to separating it out into a separate unit for release.
Unfortunately the code that generated the trie used for character property
lookup seems to be missing, and so this library will probably be tethered
to Unicode 16.0.0 indefinitely unless the author gets somehow inspired
to recreate that generation program. 😖 If staying up-to-date with new
Unicode versions is important to you then you should probably use
unicode_segmentation
instead.
It would in principle be possible to use a property lookup table maintained
outside of this crate and then produce CharProperties
values to pass
into a GraphemeMachine
without using this library’s lookup tables at
all, though I expect few would be motivated to do that.
Structs§
- Char
Properties - Represents selections from the two derived Unicode character properties used for grapheme cluster segmenttion:
- Grapheme
Machine - A finite state machine for detecting grapheme cluster boundaries.
Enums§
- Cluster
Action - What to do with a new character after presenting it to a GraphemeMachine.
- GCBProperty
- Enumeration of Grapheme_Cluster_Break property values, from UAX#29 Section 3.1.
- InCB
Property - Enumeration of Indic_Conjunct_Break property values, as defined in DerivedCoreProperties.txt based on the rules in UAX#44.