WordSegmenter

Struct WordSegmenter 

Source
pub struct WordSegmenter { /* private fields */ }
Expand description

Supports loading word break data, and creating word break iterators for different string encodings.

Most segmentation methods live on WordSegmenterBorrowed, which can be obtained via WordSegmenter::new_auto() (etc) or WordSegmenter::as_borrowed().

§Content Locale

You can optionally provide a content locale to the WordSegmenter constructor. If you have information on the language of the text being segmented, providing this hint can produce higher-quality results.

If you have a content locale, use WordBreakOptions and a constructor begining with new. If you do not have a content locale use WordBreakInvariantOptions and a constructor beginning with try_new.

§Examples

Segment a string:

use icu::segmenter::WordSegmenter;

let segmenter = WordSegmenter::new_auto(Default::default());

let breakpoints: Vec<usize> =
    segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);

Segment a Latin1 byte string with a content locale:

use icu::locale::langid;
use icu::segmenter::options::WordBreakOptions;
use icu::segmenter::WordSegmenter;

let mut options = WordBreakOptions::default();
let langid = &langid!("en");
options.content_locale = Some(langid);
let segmenter = WordSegmenter::try_new_auto(options).unwrap();

let breakpoints: Vec<usize> = segmenter
    .as_borrowed()
    .segment_latin1(b"Hello World")
    .collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);

Successive boundaries can be used to retrieve the segments. In particular, the first boundary is always 0, and the last one is the length of the segmented text in code units.

use itertools::Itertools;
let text = "Mark’d ye his words?";
let segments: Vec<&str> = segmenter
    .segment_str(text)
    .tuple_windows()
    .map(|(i, j)| &text[i..j])
    .collect();
assert_eq!(
    &segments,
    &["Mark’d", " ", "ye", " ", "his", " ", "words", "?"]
);

Not all segments delimited by word boundaries are words; some are interword segments such as spaces and punctuation. The WordBreakIterator::word_type() of a boundary can be used to classify the preceding segment; WordBreakIterator::iter_with_word_type() associates each boundary with its status.

let words: Vec<&str> = segmenter
    .segment_str(text)
    .iter_with_word_type()
    .tuple_windows()
    .filter(|(_, (_, segment_type))| segment_type.is_word_like())
    .map(|((i, _), (j, _))| &text[i..j])
    .collect();
assert_eq!(&words, &["Mark’d", "ye", "his", "words"]);

Implementations§

Source§

impl WordSegmenter

Source

pub fn new_auto( _options: WordBreakInvariantOptions, ) -> WordSegmenterBorrowed<'static>

Constructs a WordSegmenter with an invariant locale and the best available compiled data for complex scripts (Chinese, Japanese, Khmer, Lao, Myanmar, and Thai).

The current behavior, which is subject to change, is to use the LSTM model when available and the dictionary model for Chinese and Japanese.

Enabled with the compiled_data and auto Cargo features.

📚 Help choosing a constructor

§Examples

Behavior with complex scripts:

use icu::segmenter::{options::WordBreakInvariantOptions, WordSegmenter};

let th_str = "ทุกสองสัปดาห์";
let ja_str = "こんにちは世界";

let segmenter =
    WordSegmenter::new_auto(WordBreakInvariantOptions::default());

let th_bps = segmenter.segment_str(th_str).collect::<Vec<_>>();
let ja_bps = segmenter.segment_str(ja_str).collect::<Vec<_>>();

assert_eq!(th_bps, [0, 9, 18, 39]);
assert_eq!(ja_bps, [0, 15, 21]);
Source

pub fn try_new_auto(options: WordBreakOptions<'_>) -> Result<Self, DataError>

Enabled with the compiled_data Cargo feature.

📚 Help choosing a constructor

Source

pub fn try_new_auto_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), options: WordBreakOptions<'_>, ) -> Result<Self, DataError>

A version of [Self :: try_new_auto] that uses custom data provided by a BufferProvider.

Enabled with the serde feature.

📚 Help choosing a constructor

Source

pub fn try_new_auto_unstable<D>( provider: &D, options: WordBreakOptions<'_>, ) -> Result<Self, DataError>

A version of Self::new_auto that uses custom data provided by a DataProvider.

📚 Help choosing a constructor

⚠️ The bounds on provider may change over time, including in SemVer minor releases.
Source

pub fn new_lstm( _options: WordBreakInvariantOptions, ) -> WordSegmenterBorrowed<'static>

Constructs a WordSegmenter with an invariant locale and compiled LSTM data for complex scripts (Burmese, Khmer, Lao, and Thai).

The LSTM, or Long Term Short Memory, is a machine learning model. It is smaller than the full dictionary but more expensive during segmentation (inference).

Warning: there is not currently an LSTM model for Chinese or Japanese, so the WordSegmenter created by this function will have unexpected behavior in spans of those scripts.

Enabled with the compiled_data and lstm Cargo features.

📚 Help choosing a constructor

§Examples

Behavior with complex scripts:

use icu::segmenter::{options::WordBreakInvariantOptions, WordSegmenter};

let th_str = "ทุกสองสัปดาห์";
let ja_str = "こんにちは世界";

let segmenter =
    WordSegmenter::new_lstm(WordBreakInvariantOptions::default());

let th_bps = segmenter.segment_str(th_str).collect::<Vec<_>>();
let ja_bps = segmenter.segment_str(ja_str).collect::<Vec<_>>();

assert_eq!(th_bps, [0, 9, 18, 39]);

// Note: We aren't able to find a suitable breakpoint in Chinese/Japanese.
assert_eq!(ja_bps, [0, 21]);
Source

pub fn try_new_lstm(options: WordBreakOptions<'_>) -> Result<Self, DataError>

Enabled with the compiled_data Cargo feature.

📚 Help choosing a constructor

Source

pub fn try_new_lstm_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), options: WordBreakOptions<'_>, ) -> Result<Self, DataError>

A version of [Self :: try_new_lstm] that uses custom data provided by a BufferProvider.

Enabled with the serde feature.

📚 Help choosing a constructor

Source

pub fn try_new_lstm_unstable<D>( provider: &D, options: WordBreakOptions<'_>, ) -> Result<Self, DataError>

A version of Self::new_lstm that uses custom data provided by a DataProvider.

📚 Help choosing a constructor

⚠️ The bounds on provider may change over time, including in SemVer minor releases.
Source

pub fn new_dictionary( _options: WordBreakInvariantOptions, ) -> WordSegmenterBorrowed<'static>

Construct a WordSegmenter with an invariant locale and compiled dictionary data for complex scripts (Chinese, Japanese, Khmer, Lao, Myanmar, and Thai).

The dictionary model uses a list of words to determine appropriate breakpoints. It is faster than the LSTM model but requires more data.

Enabled with the compiled_data Cargo feature.

📚 Help choosing a constructor

§Examples

Behavior with complex scripts:

use icu::segmenter::{options::WordBreakInvariantOptions, WordSegmenter};

let th_str = "ทุกสองสัปดาห์";
let ja_str = "こんにちは世界";

let segmenter =
    WordSegmenter::new_dictionary(WordBreakInvariantOptions::default());

let th_bps = segmenter.segment_str(th_str).collect::<Vec<_>>();
let ja_bps = segmenter.segment_str(ja_str).collect::<Vec<_>>();

assert_eq!(th_bps, [0, 9, 18, 39]);
assert_eq!(ja_bps, [0, 15, 21]);
Source

pub fn try_new_dictionary( options: WordBreakOptions<'_>, ) -> Result<Self, DataError>

Enabled with the compiled_data Cargo feature.

📚 Help choosing a constructor

Source

pub fn try_new_dictionary_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), options: WordBreakOptions<'_>, ) -> Result<Self, DataError>

A version of [Self :: try_new_dictionary] that uses custom data provided by a BufferProvider.

Enabled with the serde feature.

📚 Help choosing a constructor

Source

pub fn try_new_dictionary_unstable<D>( provider: &D, options: WordBreakOptions<'_>, ) -> Result<Self, DataError>

A version of Self::new_dictionary that uses custom data provided by a DataProvider.

📚 Help choosing a constructor

⚠️ The bounds on provider may change over time, including in SemVer minor releases.
Source

pub fn as_borrowed(&self) -> WordSegmenterBorrowed<'_>

Constructs a borrowed version of this type for more efficient querying.

Most useful methods for segmentation are on this type.

Trait Implementations§

Source§

impl Debug for WordSegmenter

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<T> ErasedDestructor for T
where T: 'static,