Skip to main content

Crate icu_segmenter

Crate icu_segmenter 

Source
Expand description

Segment strings by lines, graphemes, words, and sentences.

This module is published as its own crate (icu_segmenter) and as part of the icu crate. See the latter for more details on the ICU4X project.

This module contains segmenter implementation for the following rules.

  • Line segmenter that is compatible with Unicode Standard Annex #14 (Version 15.1.0) Unicode Line Breaking Algorithm, with options to tailor line-breaking behavior for CSS line-break and word-break properties.
  • Grapheme cluster segmenter, word segmenter, and sentence segmenter that are compatible with Unicode Standard Annex #29 (Version 17.0.0), Unicode Text Segmentation.

§Examples

§Line Break

Find line break opportunities:

 use icu::segmenter::LineSegmenter;

 let segmenter = LineSegmenter::new_auto(Default::default());

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(&breakpoints, &[0, 6, 13, 17, 23, 29, 36]);

See LineSegmenter for more examples.

§Grapheme Cluster Break

Find all grapheme cluster boundaries:

 use icu::segmenter::GraphemeClusterSegmenter;

 let segmenter = GraphemeClusterSegmenter::new();

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(
     &breakpoints,
     &[
         0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
         19, 21, 22, 23, 24, 25, 28, 29, 30, 31, 34, 35, 36
     ]
 );

See GraphemeClusterSegmenter for more examples.

§Word Break

Find all word boundaries:

 use icu::segmenter::{options::WordBreakInvariantOptions, WordSegmenter};

 let segmenter =
     WordSegmenter::new_auto(WordBreakInvariantOptions::default());

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(
     &breakpoints,
     &[0, 5, 6, 11, 12, 13, 16, 17, 22, 23, 28, 29, 35, 36]
 );

See WordSegmenter for more examples.

§Sentence Break

Segment the string into sentences:

 use icu::segmenter::{
     options::SentenceBreakInvariantOptions, SentenceSegmenter,
 };

 let segmenter =
     SentenceSegmenter::new(SentenceBreakInvariantOptions::default());

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(&breakpoints, &[0, 13, 36]);

See SentenceSegmenter for more examples.

Modules§

iterators
Types supporting iteration over segments. Obtained from the segmenter types.
options
Options structs and enums
provider
🚧 [Unstable] Data provider struct definitions for this ICU4X component.
scaffold
Largely-internal scaffolding types (You should very rarely need to reference these directly)

Structs§

GraphemeClusterSegmenter
Segments a string into grapheme clusters.
GraphemeClusterSegmenterBorrowed
Segments a string into grapheme clusters (borrowed version).
LineSegmenter
Supports loading line break data, and creating line break iterators for different string encodings.
LineSegmenterBorrowed
Segments a string into lines (borrowed version).
SentenceSegmenter
Supports loading sentence break data, and creating sentence break iterators for different string encodings.
SentenceSegmenterBorrowed
Segments a string into sentences (borrowed version).
WordSegmenter
Supports loading word break data, and creating word break iterators for different string encodings.
WordSegmenterBorrowed
Segments a string into words (borrowed version).