Skip to main content

Segmenter

Struct Segmenter 

Source
pub struct Segmenter {
    pub mode: Mode,
    pub dictionary: Dictionary,
    pub user_dictionary: Option<UserDictionary>,
    pub keep_whitespace: bool,
    /* private fields */
}
Expand description

Segmenter

Fields§

§mode: Mode

The segmentation mode to be used by the segmenter. This determines how the text will be split into segments.

§dictionary: Dictionary

The dictionary used for segmenting text. This dictionary contains the necessary data structures and algorithms to perform morphological analysis and tokenization.

§user_dictionary: Option<UserDictionary>

An optional user-defined dictionary that can be used to customize the segmentation process. If provided, this dictionary will be used in addition to the default dictionary to improve the accuracy of segmentation for specific words or phrases.

§keep_whitespace: bool

Keep whitespace tokens in output.

When false (default), whitespace is ignored for MeCab compatibility. When true, whitespace tokens are included in the output.

Implementations§

Source§

impl Segmenter

Source

pub fn new( mode: Mode, dictionary: Dictionary, user_dictionary: Option<UserDictionary>, ) -> Self

Creates a new instance with the specified mode, dictionary, and optional user dictionary.

§Arguments
  • mode - The Mode in which the instance will operate. This typically defines how aggressively the text is segmented or processed.
  • dictionary - A Dictionary object that provides the core data and rules for processing text.
  • user_dictionary - An optional UserDictionary that allows for additional, user-defined tokens or rules to be used in conjunction with the main dictionary.
§Returns

Returns a new instance of the struct with the provided mode, dictionary, and user dictionary (if any).

§Details
  • mode: This defines the behavior of the instance, such as whether to process text in normal or aggressive mode.
  • dictionary: The main dictionary containing tokenization or processing rules.
  • user_dictionary: This is optional. If provided, it allows the user to extend or override the rules of the main dictionary with custom tokens.
Source

pub fn keep_whitespace(self, keep_whitespace: bool) -> Self

Builder method to set whether to keep whitespace tokens in output.

When keep_whitespace is false (default), whitespace is ignored for MeCab compatibility. When true, whitespace tokens are included in the output.

§Arguments
  • keep_whitespace - If true, whitespace tokens will be included in the output.
§Example
use lindera::mode::Mode;
use lindera::dictionary::load_dictionary;
use lindera::segmenter::Segmenter;

let dictionary = load_dictionary("embedded://ipadic")?;
let segmenter = Segmenter::new(Mode::Normal, dictionary, None)
    .keep_whitespace(true);
Source

pub fn from_config(config: &SegmenterConfig) -> LinderaResult<Self>

A struct representing a segmenter for tokenizing text.

The Segmenter struct provides methods for creating a segmenter from a configuration, creating a new segmenter, and segmenting text into tokens.

§Methods
  • from_config: Creates a Segmenter from a given configuration.
  • new: Creates a new Segmenter with the specified mode, dictionary, and optional user dictionary.
  • segment: Segments the given text into tokens.
§Errors

Methods that return LinderaResult may produce errors related to dictionary loading, user dictionary loading, or tokenization process.

Source

pub fn segment<'a>( &'a self, text: Cow<'a, str>, ) -> LinderaResult<Vec<Token<'a>>>

Segments the input text into tokens based on the dictionary and user-defined rules.

§Arguments
  • text - A Cow<'a, str> representing the input text. This can either be borrowed or owned, allowing for efficient text handling depending on the use case.
§Returns

Returns a LinderaResult<Vec<Token<'a>>> which contains a vector of tokens segmented from the input text. Each token represents a portion of the original text, along with metadata such as byte offsets and dictionary information.

§Process
  1. Sentence Splitting:

    • The input text is split into sentences using Japanese punctuation (, , \n, \t). Each sentence is processed individually.
  2. Lattice Processing:

    • For each sentence, a lattice structure is set up using the main dictionary and, if available, the user dictionary. The lattice helps identify possible token boundaries within the sentence.
    • The cost matrix is used to calculate the best path (i.e., the optimal sequence of tokens) through the lattice based on the mode.
  3. Token Generation:

    • For each segment (determined by the lattice), a token is generated using the byte offsets. The tokens contain the original text (in Cow::Owned form to ensure safe return), byte start/end positions, token positions, and dictionary references.
§Notes
  • The function ensures that each token is safely returned by converting substrings into Cow::Owned strings.
  • Byte offsets are carefully calculated to ensure that token boundaries are correct even across multiple sentences.
§Example Flow
  • Text is split into sentences based on punctuation.
  • A lattice is created and processed for each sentence.
  • Tokens are extracted from the lattice and returned in a vector.
§Errors
  • If the lattice fails to be processed or if there is an issue with the segmentation process, the function returns an error.
Source

pub fn segment_with_lattice<'a>( &'a self, text: Cow<'a, str>, lattice: &mut Lattice, ) -> LinderaResult<Vec<Token<'a>>>

Segments the input text into tokens based on the dictionary and user-defined rules.

§Arguments
  • text - A Cow<'a, str> representing the input text. This can either be borrowed or owned, allowing for efficient text handling depending on the use case.
  • lattice - A mutable reference to a Lattice structure. This allows reusing the lattice across multiple calls to avoid memory allocation.
§Returns

Returns a LinderaResult<Vec<Token<'a>>> which contains a vector of tokens segmented from the input text. Each token represents a portion of the original text, along with metadata such as byte offsets and dictionary information.

§Process
  1. Sentence Splitting:

    • The input text is split into sentences using Japanese punctuation (, , \n, \t). Each sentence is processed individually.
  2. Lattice Processing:

    • For each sentence, a lattice structure is set up using the main dictionary and, if available, the user dictionary. The lattice helps identify possible token boundaries within the sentence.
    • The cost matrix is used to calculate the best path (i.e., the optimal sequence of tokens) through the lattice based on the mode.
  3. Token Generation:

    • For each segment (determined by the lattice), a token is generated using the byte offsets. The tokens contain the original text (in Cow::Owned form to ensure safe return), byte start/end positions, token positions, and dictionary references.
§Notes
  • The function ensures that each token is safely returned by converting substrings into Cow::Owned strings.
  • Byte offsets are carefully calculated to ensure that token boundaries are correct even across multiple sentences.
§Example Flow
  • Text is split into sentences based on punctuation.
  • A lattice is created and processed for each sentence.
  • Tokens are extracted from the lattice and returned in a vector.
§Errors
  • If the lattice fails to be processed or if there is an issue with the segmentation process, the function returns an error.
Source

pub fn segment_nbest<'a>( &'a self, text: Cow<'a, str>, n: usize, unique: bool, cost_threshold: Option<i64>, ) -> LinderaResult<Vec<(Vec<Token<'a>>, i64)>>

Segments the input text and returns the top-N segmentation results.

Each result is a Vec<Token> representing one possible segmentation. Results are ordered by cost (best first). If unique is true, results with the same word boundaries but different POS tags are deduplicated (only the lowest-cost variant is kept).

Source

pub fn segment_nbest_with_lattice<'a>( &'a self, text: Cow<'a, str>, lattice: &mut Lattice, n: usize, unique: bool, cost_threshold: Option<i64>, ) -> LinderaResult<Vec<(Vec<Token<'a>>, i64)>>

Segments the input text and returns the top-N segmentation results with costs. Each result is a (tokens, cost) pair. If unique is true, results with the same word boundaries but different POS tags are deduplicated (only the lowest-cost variant is kept). If cost_threshold is Some(t), paths whose cost exceeds best_cost + t are discarded.

Trait Implementations§

Source§

impl Clone for Segmenter

Source§

fn clone(&self) -> Segmenter

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> ArchivePointee for T

Source§

type ArchivedMetadata = ()

The archived version of the pointer metadata for this type.
Source§

fn pointer_metadata( _: &<T as ArchivePointee>::ArchivedMetadata, ) -> <T as Pointee>::Metadata

Converts some archived metadata to the pointer metadata for itself.
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> LayoutRaw for T

Source§

fn layout_raw(_: <T as Pointee>::Metadata) -> Result<Layout, LayoutError>

Returns the layout of the type.
Source§

impl<T, N1, N2> Niching<NichedOption<T, N1>> for N2
where T: SharedNiching<N1, N2>, N1: Niching<T>, N2: Niching<T>,

Source§

unsafe fn is_niched(niched: *const NichedOption<T, N1>) -> bool

Returns whether the given value has been niched. Read more
Source§

fn resolve_niched(out: Place<NichedOption<T, N1>>)

Writes data to out indicating that a T is niched.
Source§

impl<T> Pointee for T

Source§

type Metadata = ()

The metadata type for pointers and references to this type.
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.