Expand description

🚧 [Experimental] Segment strings by lines, graphemes, word, and sentences.

This module is published as its own crate (icu_segmenter) and as part of the icu crate. See the latter for more details on the ICU4X project.

This module contains segmenter implementation for the following rules.

🚧 This code is experimental; it may change at any time, in breaking or non-breaking ways, including in SemVer minor releases. It can be enabled with the "experimental" feature of the icu meta-crate. Use with caution. #2259

Examples

Line Break

Segment a string with default options:

 use icu::segmenter::LineBreakSegmenter;

 let segmenter =
     LineBreakSegmenter::try_new_unstable(&icu_testdata::unstable())
         .expect("Data exists");

 let breakpoints: Vec<usize> =
     segmenter.segment_str("Hello World").collect();
 assert_eq!(&breakpoints, &[6, 11]);

See LineBreakSegmenter for more examples.

Grapheme Cluster Break

See GraphemeClusterBreakSegmenter for examples.

Word Break

Segment a string:

 use icu::segmenter::WordBreakSegmenter;

 let segmenter =
     WordBreakSegmenter::try_new_unstable(&icu_testdata::unstable())
         .expect("Data exists");

 let breakpoints: Vec<usize> =
     segmenter.segment_str("Hello World").collect();
 assert_eq!(&breakpoints, &[0, 5, 6, 11]);

See WordBreakSegmenter for more examples.

Sentence Break

See SentenceBreakSegmenter for examples.

Modules

Data provider struct definitions for this ICU4X component.

Structs

Segments a string into grapheme clusters.
Implements the Iterator trait over the line break opportunities of the given string. Please see the examples in LineBreakSegmenter for its usages.
Options to tailor line breaking behavior, such as for CSS.
Supports loading line break data, and creating line break iterators for different string encodings.
Implements the Iterator trait over the segmenter break opportunities of the given string.
Supports loading sentence break data, and creating sentence break iterators for different string encodings.
Supports loading word break data, and creating word break iterators for different string encodings.

Enums

A list of error outcomes for various operations in the icu_timezone crate.
An enum specifies the strictness of line-breaking rules. It can be passed as an argument when creating a line breaker.
A list of error outcomes for various operations in the icu_timezone crate.
An enum specifies the line break opportunities between letters. It can be passed as an argument when creating a line breaker.

Type Definitions

Grapheme cluster break iterator for a Latin-1 (8-bit) string.
Grapheme cluster break iterator for a potentially invalid UTF-8 string.
Grapheme cluster break iterator for an str (a UTF-8 string).
Grapheme cluster break iterator for a UTF-16 string.
Line break iterator for a Latin-1 (8-bit) string.
Line break iterator for a potentially invalid UTF-8 string
Line break iterator for an str (a UTF-8 string).
Line break iterator for a UTF-16 string.
Sentence break iterator for a Latin-1 (8-bit) string.
Sentence break iterator for potentially invalid UTF-8 strings
Sentence break iterator for an str (a UTF-8 string).
Sentence break iterator for a UTF-16 string.
Word break iterator for a Latin-1 (8-bit) string.
Word break iterator for a potentially invalid UTF-8 string
Word break iterator for an str (a UTF-8 string).
Word break iterator for a UTF-16 string.