Crate icu_segmenter
source · [−]Expand description
A segmenter implementation for the following rules.
- Line breaker that is compatible with Unicode Standard Annex #14 and CSS properties.
- Grapheme cluster breaker, word breaker, and sentence breaker that are compatible with Unicode Standard Annex #29.
Examples
Line Break
Segment a string with default options:
use icu_segmenter::LineBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = LineBreakSegmenter::try_new(&provider)
.expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);Segment a string with CSS option overrides:
use icu_segmenter::{LineBreakSegmenter, LineBreakOptions, LineBreakRule, WordBreakRule};
let mut options = LineBreakOptions::default();
options.line_break_rule = LineBreakRule::Strict;
options.word_break_rule = WordBreakRule::BreakAll;
options.ja_zh = false;
let provider = icu_testdata::get_provider();
let segmenter = LineBreakSegmenter::try_new_with_options(&provider, options)
.expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[1, 2, 3, 4, 6, 7, 8, 9, 10, 11]);Segment a Latin1 byte string:
use icu_segmenter::LineBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = LineBreakSegmenter::try_new(&provider)
.expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);Grapheme Cluster Break
Segment a string:
use icu_segmenter::GraphemeClusterBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = GraphemeClusterBreakSegmenter::try_new(&provider)
.expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str("Hello 🗺").collect();
// World Map (U+1F5FA) is encoded in four bytes in UTF-8.
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 10]);Segment a Latin1 byte string:
use icu_segmenter::GraphemeClusterBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = GraphemeClusterBreakSegmenter::try_new(&provider)
.expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]);Word Break
Segment a string:
use icu_segmenter::WordBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = WordBreakSegmenter::try_new(&provider)
.expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);Segment a Latin1 byte string:
use icu_segmenter::WordBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = WordBreakSegmenter::try_new(&provider)
.expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);Sentence Break
Segment a string:
use icu_segmenter::SentenceBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = SentenceBreakSegmenter::try_new(&provider)
.expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);Segment a Latin1 byte string:
use icu_segmenter::SentenceBreakSegmenter;
let provider = icu_testdata::get_provider();
let segmenter = SentenceBreakSegmenter::try_new(&provider)
.expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);Modules
Structs
Marker type for RuleBreakDataV1: “segmenter/grapheme@1”
Supports loading grapheme cluster break data, and creating grapheme cluster break iterators for different string encodings. Please see the module-level documentation for its usages.
Marker type for RuleBreakDataV1: “segmenter/line@1”
Implements the Iterator trait over the line break opportunities of the given string. Please
see the module-level documentation for its usages.
Options to tailor line breaking behavior, such as for CSS.
Supports loading line break data, and creating line break iterators for different string encodings. Please see the module-level documentation for its usages.
Pre-processed Unicode data in the form of tables to be used for rule-based breaking.
Property table for rule-based breaking.
Break state table for rule-based breaking.
Marker type for RuleBreakDataV1: “segmenter/sentence@1”
Supports loading sentence break data, and creating sentence break iterators for different string encodings. Please see the module-level documentation for its usages.
Marker type for [UCharDictionaryBreakDataV1]: “segmenter/char16trie@1”
Marker type for RuleBreakDataV1: “segmenter/word@1”
Supports loading word break data, and creating word break iterators for different string encodings. Please see the module-level documentation for its usages.
Enums
An enum specifies the strictness of line-breaking rules. It can be passed as an argument when creating a line breaker.
An enum specifies the line break opportunities between letters. It can be passed as an argument when creating a line breaker.
Constants
Type Definitions
Grapheme cluster break iterator for an str (a UTF-8 string).
Grapheme cluster break iterator for a Latin-1 (8-bit) string.
Grapheme cluster break iterator for a UTF-16 string.
Sentence break iterator for an str (a UTF-8 string).
Sentence break iterator for a Latin-1 (8-bit) string.
Sentence break iterator for a UTF-16 string.
Word break iterator for an str (a UTF-8 string).
Word break iterator for a Latin-1 (8-bit) string.
Word break iterator for a UTF-16 string.