icu_segmenter 1.0.0-alpha1

Unicode line breaking and text segmentation algorithms for text boundaries analysis
Documentation

icu_segmenter crates.io

[Experimental] Segment strings by lines, graphemes, word, and sentences.

This module is published as its own crate (icu_segmenter) and as part of the icu crate. See the latter for more details on the ICU4X project.

This module contains segmenter implementation for the following rules.

Examples

Line Break

Segment a string with default options:

use icu_segmenter::LineBreakSegmenter;

let provider = icu_testdata::get_provider();
let segmenter = LineBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);

Segment a string with CSS option overrides:

use icu_segmenter::{LineBreakOptions, LineBreakRule, LineBreakSegmenter, WordBreakRule};

let mut options = LineBreakOptions::default();
options.line_break_rule = LineBreakRule::Strict;
options.word_break_rule = WordBreakRule::BreakAll;
options.ja_zh = false;
let provider = icu_testdata::get_provider();
let segmenter =
    LineBreakSegmenter::try_new_with_options(&provider, options).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[1, 2, 3, 4, 6, 7, 8, 9, 10, 11]);

Segment a Latin1 byte string:

use icu_segmenter::LineBreakSegmenter;

let provider = icu_testdata::get_provider();
let segmenter = LineBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);

Grapheme Cluster Break

Segment a string:

use icu_segmenter::GraphemeClusterBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = GraphemeClusterBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello 🗺").collect();
// World Map (U+1F5FA) is encoded in four bytes in UTF-8.
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 10]);

Segment a Latin1 byte string:

use icu_segmenter::GraphemeClusterBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = GraphemeClusterBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]);

Word Break

Segment a string:

use icu_segmenter::WordBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = WordBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);

Segment a Latin1 byte string:

use icu_segmenter::WordBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = WordBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);

Sentence Break

Segment a string:

use icu_segmenter::SentenceBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = SentenceBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);

Segment a Latin1 byte string:

use icu_segmenter::SentenceBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = SentenceBreakSegmenter::try_new(&provider).expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);

More Information

For more information on development, authorship, contributing etc. please visit ICU4X home page.