Crate icu::segmenter

source ·
Expand description

Segment strings by lines, graphemes, words, and sentences.

This module is published as its own crate (icu_segmenter) and as part of the icu crate. See the latter for more details on the ICU4X project.

This module contains segmenter implementation for the following rules.

Examples

Line Break

Find line break opportunities:

 use icu::segmenter::LineSegmenter;

 let segmenter = LineSegmenter::new_auto();

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(&breakpoints, &[0, 6, 13, 17, 23, 29, 36]);

See LineSegmenter for more examples.

Grapheme Cluster Break

Find all grapheme cluster boundaries:

 use icu::segmenter::GraphemeClusterSegmenter;

 let segmenter = GraphemeClusterSegmenter::new();

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(
     &breakpoints,
     &[
         0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
         19, 21, 22, 23, 24, 25, 28, 29, 30, 31, 34, 35, 36
     ]
 );

See GraphemeClusterSegmenter for more examples.

Word Break

Find all word boundaries:

 use icu::segmenter::WordSegmenter;

 let segmenter = WordSegmenter::new_auto();

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(
     &breakpoints,
     &[0, 5, 6, 11, 12, 13, 16, 17, 22, 23, 28, 29, 35, 36]
 );

See WordSegmenter for more examples.

Sentence Break

Segment the string into sentences:

 use icu::segmenter::SentenceSegmenter;

 let segmenter = SentenceSegmenter::new();

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(&breakpoints, &[0, 13, 36]);

See SentenceSegmenter for more examples.

Modules

  • 🚧 [Unstable] Data provider struct definitions for this ICU4X component.

Structs

Enums

  • A list of error outcomes for various operations in this module.
  • An enum specifies the strictness of line-breaking rules. It can be passed as an argument when creating a line segmenter.
  • An enum specifies the line break opportunities between letters. It can be passed as an argument when creating a line segmenter.
  • A list of error outcomes for various operations in this module.
  • The word type tag that is returned by WordBreakIterator::word_type().

Type Aliases