Struct icu::segmenter::WordSegmenter

impl WordSegmenter

pub fn try_new_auto_unstable<D>( provider: &D ) -> Result<WordSegmenter, SegmenterError>where D: DataProvider<WordBreakDataV1Marker> + DataProvider<DictionaryForWordOnlyAutoV1Marker> + DataProvider<LstmForWordLineAutoV1Marker> + DataProvider<GraphemeClusterBreakDataV1Marker> + ?Sized,

Constructs a WordSegmenter with an invariant locale and the best available data for complex scripts (Chinese, Japanese, Khmer, Lao, Myanmar, and Thai).

The current behavior, which is subject to change, is to use the LSTM model when available and the dictionary model for Chinese and Japanese.

Examples

Behavior with complex scripts:

use icu::segmenter::WordSegmenter;

let th_str = "ทุกสองสัปดาห์";
let ja_str = "こんにちは世界";

let segmenter =
    WordSegmenter::try_new_auto_unstable(&icu_testdata::unstable())
        .unwrap();

let th_bps = segmenter.segment_str(th_str).collect::<Vec<_>>();
let ja_bps = segmenter.segment_str(ja_str).collect::<Vec<_>>();

assert_eq!(th_bps, [0, 9, 18, 39]);
assert_eq!(ja_bps, [0, 15, 21]);

pub fn try_new_auto_with_any_provider( provider: &impl AnyProvider ) -> Result<WordSegmenter, SegmenterError>

Creates a new instance using an AnyProvider.

For details on the behavior of this function, see: Self::try_new_auto_unstable

pub fn try_new_auto_with_buffer_provider( provider: &impl BufferProvider ) -> Result<WordSegmenter, SegmenterError>

✨ Enabled with the "serde" feature.

Creates a new instance using a BufferProvider.

For details on the behavior of this function, see: Self::try_new_auto_unstable

pub fn try_new_lstm_unstable<D>( provider: &D ) -> Result<WordSegmenter, SegmenterError>where D: DataProvider<WordBreakDataV1Marker> + DataProvider<LstmForWordLineAutoV1Marker> + DataProvider<GraphemeClusterBreakDataV1Marker> + ?Sized,

Constructs a WordSegmenter with an invariant locale and LSTM data for complex scripts (Burmese, Khmer, Lao, and Thai).

The LSTM, or Long Term Short Memory, is a machine learning model. It is smaller than the full dictionary but more expensive during segmentation (inference).

Warning: there is not currently an LSTM model for Chinese or Japanese, so the WordSegmenter created by this function will have unexpected behavior in spans of those scripts.

Examples

Behavior with complex scripts:

use icu::segmenter::WordSegmenter;

let th_str = "ทุกสองสัปดาห์";
let ja_str = "こんにちは世界";

let segmenter =
    WordSegmenter::try_new_lstm_unstable(&icu_testdata::unstable())
        .unwrap();

let th_bps = segmenter.segment_str(th_str).collect::<Vec<_>>();
let ja_bps = segmenter.segment_str(ja_str).collect::<Vec<_>>();

assert_eq!(th_bps, [0, 9, 18, 39]);

// Note: We aren't able to find a suitable breakpoint in Chinese/Japanese.
assert_eq!(ja_bps, [0, 21]);

pub fn try_new_lstm_with_any_provider( provider: &impl AnyProvider ) -> Result<WordSegmenter, SegmenterError>

Creates a new instance using an AnyProvider.

For details on the behavior of this function, see: Self::try_new_lstm_unstable

pub fn try_new_lstm_with_buffer_provider( provider: &impl BufferProvider ) -> Result<WordSegmenter, SegmenterError>

✨ Enabled with the "serde" feature.

Creates a new instance using a BufferProvider.

For details on the behavior of this function, see: Self::try_new_lstm_unstable

pub fn try_new_dictionary_unstable<D>( provider: &D ) -> Result<WordSegmenter, SegmenterError>where D: DataProvider<WordBreakDataV1Marker> + DataProvider<DictionaryForWordOnlyAutoV1Marker> + DataProvider<DictionaryForWordLineExtendedV1Marker> + DataProvider<GraphemeClusterBreakDataV1Marker> + ?Sized,

Construct a WordSegmenter with an invariant locale and dictionary data for complex scripts (Chinese, Japanese, Khmer, Lao, Myanmar, and Thai).

The dictionary model uses a list of words to determine appropriate breakpoints. It is faster than the LSTM model but requires more data.

Examples

Behavior with complex scripts:

use icu::segmenter::WordSegmenter;

let th_str = "ทุกสองสัปดาห์";
let ja_str = "こんにちは世界";

let segmenter =
    WordSegmenter::try_new_dictionary_unstable(&icu_testdata::unstable())
        .unwrap();

let th_bps = segmenter.segment_str(th_str).collect::<Vec<_>>();
let ja_bps = segmenter.segment_str(ja_str).collect::<Vec<_>>();

assert_eq!(th_bps, [0, 9, 18, 39]);
assert_eq!(ja_bps, [0, 15, 21]);

pub fn try_new_dictionary_with_any_provider( provider: &impl AnyProvider ) -> Result<WordSegmenter, SegmenterError>

Creates a new instance using an AnyProvider.

For details on the behavior of this function, see: Self::try_new_dictionary_unstable

pub fn try_new_dictionary_with_buffer_provider( provider: &impl BufferProvider ) -> Result<WordSegmenter, SegmenterError>

✨ Enabled with the "serde" feature.

Creates a new instance using a BufferProvider.

For details on the behavior of this function, see: Self::try_new_dictionary_unstable

pub fn segment_str<'l, 's>( &'l self, input: &'s str ) -> WordBreakIterator<'l, 's, WordBreakTypeUtf8> ⓘ

Creates a word break iterator for an str (a UTF-8 string).

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

pub fn segment_utf8<'l, 's>( &'l self, input: &'s [u8] ) -> WordBreakIterator<'l, 's, WordBreakTypePotentiallyIllFormedUtf8> ⓘ

Creates a word break iterator for a potentially ill-formed UTF8 string

Invalid characters are treated as REPLACEMENT CHARACTER

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

pub fn segment_latin1<'l, 's>( &'l self, input: &'s [u8] ) -> WordBreakIterator<'l, 's, RuleBreakTypeLatin1> ⓘ

Creates a word break iterator for a Latin-1 (8-bit) string.

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

pub fn segment_utf16<'l, 's>( &'l self, input: &'s [u16] ) -> WordBreakIterator<'l, 's, WordBreakTypeUtf16> ⓘ

Creates a word break iterator for a UTF-16 string.

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

Trait Implementations§

impl Debug for WordSegmenter

fn fmt(&self, f: &mut Formatter<'_>) -> Result<(), Error>

Formats the value using the given formatter. Read more

Auto Trait Implementations§

impl UnwindSafe for WordSegmenter

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

impl<T> Borrow<T> for Twhere T: ?Sized,

const: unstable · source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

impl<T> BorrowMut<T> for Twhere T: ?Sized,

const: unstable · source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

impl<T> From<T> for T

const: unstable · source§

fn from(t: T) -> T

Returns the argument unchanged.

impl<T, U> Into for Twhere U: From<T>,

const: unstable · source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

impl<T, U> TryFrom for Twhere U: Into<T>,

type Error = Infallible

The type returned in the event of a conversion error.

const: unstable · source§

fn try_from(value: U) -> Result<T, <T as TryFrom>::Error>

Performs the conversion.

impl<T, U> TryInto for Twhere U: TryFrom<T>,

type Error = >::Error

The type returned in the event of a conversion error.

const: unstable · source§

fn try_into(self) -> Result<U, >::Error>

Performs the conversion.

impl<T> ErasedDestructor for Twhere T: 'static,