icu_segmenter
[Experimental] Segment strings by lines, graphemes, word, and sentences.
This module is published as its own crate (icu_segmenter
)
and as part of the icu
crate. See the latter for more details on the ICU4X project.
This module contains segmenter implementation for the following rules.
- Line breaker that is compatible with Unicode Standard Annex #14 and CSS properties.
- Grapheme cluster breaker, word breaker, and sentence breaker that are compatible with Unicode Standard Annex #29.
Examples
Line Break
Segment a string with default options:
use LineBreakSegmenter;
let provider = get_provider;
let segmenter = try_new.expect;
let breakpoints: = segmenter.segment_str.collect;
assert_eq!;
Segment a string with CSS option overrides:
use ;
let mut options = default;
options.line_break_rule = Strict;
options.word_break_rule = BreakAll;
options.ja_zh = false;
let provider = get_provider;
let segmenter =
try_new_with_options.expect;
let breakpoints: = segmenter.segment_str.collect;
assert_eq!;
Segment a Latin1 byte string:
use LineBreakSegmenter;
let provider = get_provider;
let segmenter = try_new.expect;
let breakpoints: = segmenter.segment_latin1.collect;
assert_eq!;
Grapheme Cluster Break
Segment a string:
use GraphemeClusterBreakSegmenter;
let provider = get_provider;
let segmenter = try_new.expect;
let breakpoints: = segmenter.segment_str.collect;
// World Map (U+1F5FA) is encoded in four bytes in UTF-8.
assert_eq!;
Segment a Latin1 byte string:
use GraphemeClusterBreakSegmenter;
let provider = get_provider;
let segmenter = try_new.expect;
let breakpoints: = segmenter.segment_latin1.collect;
assert_eq!;
Word Break
Segment a string:
use WordBreakSegmenter;
let provider = get_provider;
let segmenter = try_new.expect;
let breakpoints: = segmenter.segment_str.collect;
assert_eq!;
Segment a Latin1 byte string:
use WordBreakSegmenter;
let provider = get_provider;
let segmenter = try_new.expect;
let breakpoints: = segmenter.segment_latin1.collect;
assert_eq!;
Sentence Break
Segment a string:
use SentenceBreakSegmenter;
let provider = get_provider;
let segmenter = try_new.expect;
let breakpoints: = segmenter.segment_str.collect;
assert_eq!;
Segment a Latin1 byte string:
use SentenceBreakSegmenter;
let provider = get_provider;
let segmenter = try_new.expect;
let breakpoints: = segmenter.segment_latin1.collect;
assert_eq!;
More Information
For more information on development, authorship, contributing etc. please visit ICU4X home page
.