Expand description

A segmenter implementation for the following rules.

Examples

Line Break

Segment a string with default options:

 use icu_segmenter::LineBreakSegmenter;

 let provider = icu_testdata::get_provider();
 let segmenter = LineBreakSegmenter::try_new(&provider)
     .expect("Data exists");

 let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
 assert_eq!(&breakpoints, &[6, 11]);

Segment a string with CSS option overrides:

use icu_segmenter::{LineBreakSegmenter, LineBreakOptions, LineBreakRule, WordBreakRule};

let mut options = LineBreakOptions::default();
options.line_break_rule = LineBreakRule::Strict;
options.word_break_rule = WordBreakRule::BreakAll;
options.ja_zh = false;
let provider = icu_testdata::get_provider();
let segmenter = LineBreakSegmenter::try_new_with_options(&provider, options)
    .expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[1, 2, 3, 4, 6, 7, 8, 9, 10, 11]);

Segment a Latin1 byte string:

use icu_segmenter::LineBreakSegmenter;

let provider = icu_testdata::get_provider();
let segmenter = LineBreakSegmenter::try_new(&provider)
    .expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);

Grapheme Cluster Break

Segment a string:

 use icu_segmenter::GraphemeClusterBreakSegmenter;
 let provider = icu_testdata::get_provider();
 let segmenter = GraphemeClusterBreakSegmenter::try_new(&provider)
     .expect("Data exists");

 let breakpoints: Vec<usize> = segmenter.segment_str("Hello 🗺").collect();
 // World Map (U+1F5FA) is encoded in four bytes in UTF-8.
 assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 10]);

Segment a Latin1 byte string:

use icu_segmenter::GraphemeClusterBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = GraphemeClusterBreakSegmenter::try_new(&provider)
    .expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]);

Word Break

Segment a string:

 use icu_segmenter::WordBreakSegmenter;
 let provider = icu_testdata::get_provider();
 let segmenter = WordBreakSegmenter::try_new(&provider)
     .expect("Data exists");

 let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
 assert_eq!(&breakpoints, &[0, 5, 6, 11]);

Segment a Latin1 byte string:

use icu_segmenter::WordBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = WordBreakSegmenter::try_new(&provider)
    .expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);

Sentence Break

Segment a string:

 use icu_segmenter::SentenceBreakSegmenter;
 let provider = icu_testdata::get_provider();
 let segmenter = SentenceBreakSegmenter::try_new(&provider)
     .expect("Data exists");

 let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
 assert_eq!(&breakpoints, &[0, 11]);

Segment a Latin1 byte string:

use icu_segmenter::SentenceBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = SentenceBreakSegmenter::try_new(&provider)
    .expect("Data exists");

let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);

Modules

Structs

Marker type for RuleBreakDataV1: “segmenter/grapheme@1”

Supports loading grapheme cluster break data, and creating grapheme cluster break iterators for different string encodings. Please see the module-level documentation for its usages.

Marker type for RuleBreakDataV1: “segmenter/line@1”

Implements the Iterator trait over the line break opportunities of the given string. Please see the module-level documentation for its usages.

Options to tailor line breaking behavior, such as for CSS.

Supports loading line break data, and creating line break iterators for different string encodings. Please see the module-level documentation for its usages.

Pre-processed Unicode data in the form of tables to be used for rule-based breaking.

Property table for rule-based breaking.

Break state table for rule-based breaking.

Marker type for RuleBreakDataV1: “segmenter/sentence@1”

Supports loading sentence break data, and creating sentence break iterators for different string encodings. Please see the module-level documentation for its usages.

Marker type for [UCharDictionaryBreakDataV1]: “segmenter/char16trie@1”

Marker type for RuleBreakDataV1: “segmenter/word@1”

Supports loading word break data, and creating word break iterators for different string encodings. Please see the module-level documentation for its usages.

Enums

An enum specifies the strictness of line-breaking rules. It can be passed as an argument when creating a line breaker.

An enum specifies the line break opportunities between letters. It can be passed as an argument when creating a line breaker.

Constants

Type Definitions

Grapheme cluster break iterator for an str (a UTF-8 string).

Grapheme cluster break iterator for a Latin-1 (8-bit) string.

Grapheme cluster break iterator for a UTF-16 string.

Sentence break iterator for an str (a UTF-8 string).

Sentence break iterator for a Latin-1 (8-bit) string.

Sentence break iterator for a UTF-16 string.

Word break iterator for an str (a UTF-8 string).

Word break iterator for a Latin-1 (8-bit) string.

Word break iterator for a UTF-16 string.