Crate grapheme_machine

Expand description

An implementation of the Grapheme Cluster portion of UAX #29: Unicode Text Segmentation that prioritizes streaming-friendliness and simplicity.

This library implements the segmentation algorithm as of Unicode 16.0.0, using the character database tables from that release.

GraphemeMachine is the main type in this library. Construct an object of that type and then feed it characters from a stream one at a time, and in return it will tell you for each new character whether it should be treated as an extension of the current grapheme cluster or the beginning of a new one. That’s all there is to it!

The canonical Rust library for UAX #29 is unicode_segmentation, and so that’s actually probably what you should use in most cases. This library has the following main distinctions (as of unicode_segmentation v1.12.0):

The primary entry point for grapheme clusters in unicode_segmentation is Graphemes, which expects the entire text to be in memory as a single buffer.

The library also offers GraphemeCursor for working with non-contiguous buffers, but it has a rather challenging API and is difficult to use in a completely streaming manner, with the caller required to sometimes provide earlier context to help it make a decision.

By contrast, GraphemeMachine in this library is a finite state machine that is advanced one character at a time, with no requirement for the caller to do any buffering at all. Of course in practice it’s likely that a normal caller will need to at least buffer the current grapheme cluster so it can be used once finally split, but how to manage that is left entirely up to the caller.

For example, a caller could decide that it only cares about grapheme clusters up to some reasonable maximum length, after which it will just assume malicious or corrupt input and use the Unicode replacement character instead. The GraphemeMachine can still allow that caller to find the end of that overlong grapheme cluster and begin consuming the next one even though the caller is no longer including any new characters into its buffer.
unicode_segmentation finds the relevant Unicode character properties for incoming characters using binary search over its internal tables, after converting the character into a Rust char value.

GraphemeMachine instead prefers to work with UTF-8 encoded characters as represented by u8char, which can be more cheaply extracted from and appended to Rust strings. The character property lookup is done using a trie based on the UTF-8 byte sequence, and so is potentially faster when you’re chomping UTF-8 sequences from a str buffer one at a time.

(That’s not necessarily true, though. Measure it yourself with the text you want to segment if performance is important to you!)
Although GraphemeMachine can work with char and u8char values representing specific characters, the segmentation algorithm is actually defined in terms of groups of characters that share similar properties.

This library exposes those categories as part of its public API using CharProperties, GCBProperty, and InCBProperty, and so it could be useful purely as a character property lookup library even if you don’t use GraphemeMachine, or you could even choose to use your own tailored character property tables and pass CharProperties values directly to a GraphemeMachine object.

Unless you have a good reason to prefer this library though, it’s probably better to use unicode_segmentation because it’s widely-used in the Rust community, well-maintained by an established team (whereas this library has only a single, easily-distracted author), and probably not subject to the important caveat described in the following section.

§An important caveat

The author originally wrote the code and lookup tables in this library internally within another project, and then proceeded to copy it into several other projects that needed grapheme cluster segmentation. This library is the result of finally getting around to separating it out into a separate unit for release.

Unfortunately the code that generated the trie used for character property lookup seems to be missing, and so this library will probably be tethered to Unicode 16.0.0 indefinitely unless the author gets somehow inspired to recreate that generation program. 😖 If staying up-to-date with new Unicode versions is important to you then you should probably use unicode_segmentation instead.

It would in principle be possible to use a property lookup table maintained outside of this crate and then produce CharProperties values to pass into a GraphemeMachine without using this library’s lookup tables at all, though I expect few would be motivated to do that.

Structs§

CharProperties: Represents selections from the two derived Unicode character properties used for grapheme cluster segmenttion:
GraphemeMachine: A finite state machine for detecting grapheme cluster boundaries.

Enums§

ClusterAction: What to do with a new character after presenting it to a GraphemeMachine.
GCBProperty: Enumeration of Grapheme_Cluster_Break property values, from UAX#29 Section 3.1.
InCBProperty: Enumeration of Indic_Conjunct_Break property values, as defined in DerivedCoreProperties.txt based on the rules in UAX#44.