pub struct Segmenter {
pub mode: Mode,
pub dictionary: Dictionary,
pub user_dictionary: Option<UserDictionary>,
pub keep_whitespace: bool,
/* private fields */
}Expand description
Segmenter
Fields§
§mode: ModeThe segmentation mode to be used by the segmenter. This determines how the text will be split into segments.
dictionary: DictionaryThe dictionary used for segmenting text. This dictionary contains the necessary data structures and algorithms to perform morphological analysis and tokenization.
user_dictionary: Option<UserDictionary>An optional user-defined dictionary that can be used to customize the segmentation process. If provided, this dictionary will be used in addition to the default dictionary to improve the accuracy of segmentation for specific words or phrases.
keep_whitespace: boolKeep whitespace tokens in output.
When false (default), whitespace is ignored for MeCab compatibility. When true, whitespace tokens are included in the output.
Implementations§
Source§impl Segmenter
impl Segmenter
Sourcepub fn new(
mode: Mode,
dictionary: Dictionary,
user_dictionary: Option<UserDictionary>,
) -> Self
pub fn new( mode: Mode, dictionary: Dictionary, user_dictionary: Option<UserDictionary>, ) -> Self
Creates a new instance with the specified mode, dictionary, and optional user dictionary.
§Arguments
mode- TheModein which the instance will operate. This typically defines how aggressively the text is segmented or processed.dictionary- ADictionaryobject that provides the core data and rules for processing text.user_dictionary- An optionalUserDictionarythat allows for additional, user-defined tokens or rules to be used in conjunction with the main dictionary.
§Returns
Returns a new instance of the struct with the provided mode, dictionary, and user dictionary (if any).
§Details
mode: This defines the behavior of the instance, such as whether to process text in normal or aggressive mode.dictionary: The main dictionary containing tokenization or processing rules.user_dictionary: This is optional. If provided, it allows the user to extend or override the rules of the main dictionary with custom tokens.
Sourcepub fn keep_whitespace(self, keep_whitespace: bool) -> Self
pub fn keep_whitespace(self, keep_whitespace: bool) -> Self
Builder method to set whether to keep whitespace tokens in output.
When keep_whitespace is false (default), whitespace is ignored for MeCab compatibility.
When true, whitespace tokens are included in the output.
§Arguments
keep_whitespace- If true, whitespace tokens will be included in the output.
§Example
use lindera::mode::Mode;
use lindera::dictionary::load_dictionary;
use lindera::segmenter::Segmenter;
let dictionary = load_dictionary("embedded://ipadic")?;
let segmenter = Segmenter::new(Mode::Normal, dictionary, None)
.keep_whitespace(true);Sourcepub fn from_config(config: &SegmenterConfig) -> LinderaResult<Self>
pub fn from_config(config: &SegmenterConfig) -> LinderaResult<Self>
A struct representing a segmenter for tokenizing text.
The Segmenter struct provides methods for creating a segmenter from a configuration,
creating a new segmenter, and segmenting text into tokens.
§Methods
from_config: Creates aSegmenterfrom a given configuration.new: Creates a newSegmenterwith the specified mode, dictionary, and optional user dictionary.segment: Segments the given text into tokens.
§Errors
Methods that return LinderaResult may produce errors related to dictionary loading,
user dictionary loading, or tokenization process.
Sourcepub fn segment<'a>(
&'a self,
text: Cow<'a, str>,
) -> LinderaResult<Vec<Token<'a>>>
pub fn segment<'a>( &'a self, text: Cow<'a, str>, ) -> LinderaResult<Vec<Token<'a>>>
Segments the input text into tokens based on the dictionary and user-defined rules.
§Arguments
text- ACow<'a, str>representing the input text. This can either be borrowed or owned, allowing for efficient text handling depending on the use case.
§Returns
Returns a LinderaResult<Vec<Token<'a>>> which contains a vector of tokens segmented from the input text. Each token represents a portion of the original text, along with metadata such as byte offsets and dictionary information.
§Process
-
Sentence Splitting:
- The input text is split into sentences using Japanese punctuation (
。,、,\n,\t). Each sentence is processed individually.
- The input text is split into sentences using Japanese punctuation (
-
Lattice Processing:
- For each sentence, a lattice structure is set up using the main dictionary and, if available, the user dictionary. The lattice helps identify possible token boundaries within the sentence.
- The cost matrix is used to calculate the best path (i.e., the optimal sequence of tokens) through the lattice based on the mode.
-
Token Generation:
- For each segment (determined by the lattice), a token is generated using the byte offsets. The tokens contain the original text (in
Cow::Ownedform to ensure safe return), byte start/end positions, token positions, and dictionary references.
- For each segment (determined by the lattice), a token is generated using the byte offsets. The tokens contain the original text (in
§Notes
- The function ensures that each token is safely returned by converting substrings into
Cow::Ownedstrings. - Byte offsets are carefully calculated to ensure that token boundaries are correct even across multiple sentences.
§Example Flow
- Text is split into sentences based on punctuation.
- A lattice is created and processed for each sentence.
- Tokens are extracted from the lattice and returned in a vector.
§Errors
- If the lattice fails to be processed or if there is an issue with the segmentation process, the function returns an error.
Sourcepub fn segment_with_lattice<'a>(
&'a self,
text: Cow<'a, str>,
lattice: &mut Lattice,
) -> LinderaResult<Vec<Token<'a>>>
pub fn segment_with_lattice<'a>( &'a self, text: Cow<'a, str>, lattice: &mut Lattice, ) -> LinderaResult<Vec<Token<'a>>>
Segments the input text into tokens based on the dictionary and user-defined rules.
§Arguments
text- ACow<'a, str>representing the input text. This can either be borrowed or owned, allowing for efficient text handling depending on the use case.lattice- A mutable reference to aLatticestructure. This allows reusing the lattice across multiple calls to avoid memory allocation.
§Returns
Returns a LinderaResult<Vec<Token<'a>>> which contains a vector of tokens segmented from the input text. Each token represents a portion of the original text, along with metadata such as byte offsets and dictionary information.
§Process
-
Sentence Splitting:
- The input text is split into sentences using Japanese punctuation (
。,、,\n,\t). Each sentence is processed individually.
- The input text is split into sentences using Japanese punctuation (
-
Lattice Processing:
- For each sentence, a lattice structure is set up using the main dictionary and, if available, the user dictionary. The lattice helps identify possible token boundaries within the sentence.
- The cost matrix is used to calculate the best path (i.e., the optimal sequence of tokens) through the lattice based on the mode.
-
Token Generation:
- For each segment (determined by the lattice), a token is generated using the byte offsets. The tokens contain the original text (in
Cow::Ownedform to ensure safe return), byte start/end positions, token positions, and dictionary references.
- For each segment (determined by the lattice), a token is generated using the byte offsets. The tokens contain the original text (in
§Notes
- The function ensures that each token is safely returned by converting substrings into
Cow::Ownedstrings. - Byte offsets are carefully calculated to ensure that token boundaries are correct even across multiple sentences.
§Example Flow
- Text is split into sentences based on punctuation.
- A lattice is created and processed for each sentence.
- Tokens are extracted from the lattice and returned in a vector.
§Errors
- If the lattice fails to be processed or if there is an issue with the segmentation process, the function returns an error.
Sourcepub fn segment_nbest<'a>(
&'a self,
text: Cow<'a, str>,
n: usize,
unique: bool,
cost_threshold: Option<i64>,
) -> LinderaResult<Vec<(Vec<Token<'a>>, i64)>>
pub fn segment_nbest<'a>( &'a self, text: Cow<'a, str>, n: usize, unique: bool, cost_threshold: Option<i64>, ) -> LinderaResult<Vec<(Vec<Token<'a>>, i64)>>
Segments the input text and returns the top-N segmentation results.
Each result is a Vec<Token> representing one possible segmentation.
Results are ordered by cost (best first).
If unique is true, results with the same word boundaries but different
POS tags are deduplicated (only the lowest-cost variant is kept).
Sourcepub fn segment_nbest_with_lattice<'a>(
&'a self,
text: Cow<'a, str>,
lattice: &mut Lattice,
n: usize,
unique: bool,
cost_threshold: Option<i64>,
) -> LinderaResult<Vec<(Vec<Token<'a>>, i64)>>
pub fn segment_nbest_with_lattice<'a>( &'a self, text: Cow<'a, str>, lattice: &mut Lattice, n: usize, unique: bool, cost_threshold: Option<i64>, ) -> LinderaResult<Vec<(Vec<Token<'a>>, i64)>>
Segments the input text and returns the top-N segmentation results with costs.
Each result is a (tokens, cost) pair.
If unique is true, results with the same word boundaries but different
POS tags are deduplicated (only the lowest-cost variant is kept).
If cost_threshold is Some(t), paths whose cost exceeds best_cost + t
are discarded.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for Segmenter
impl RefUnwindSafe for Segmenter
impl Send for Segmenter
impl Sync for Segmenter
impl Unpin for Segmenter
impl UnwindSafe for Segmenter
Blanket Implementations§
Source§impl<T> ArchivePointee for T
impl<T> ArchivePointee for T
Source§type ArchivedMetadata = ()
type ArchivedMetadata = ()
Source§fn pointer_metadata(
_: &<T as ArchivePointee>::ArchivedMetadata,
) -> <T as Pointee>::Metadata
fn pointer_metadata( _: &<T as ArchivePointee>::ArchivedMetadata, ) -> <T as Pointee>::Metadata
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> LayoutRaw for T
impl<T> LayoutRaw for T
Source§fn layout_raw(_: <T as Pointee>::Metadata) -> Result<Layout, LayoutError>
fn layout_raw(_: <T as Pointee>::Metadata) -> Result<Layout, LayoutError>
Source§impl<T, N1, N2> Niching<NichedOption<T, N1>> for N2
impl<T, N1, N2> Niching<NichedOption<T, N1>> for N2
Source§unsafe fn is_niched(niched: *const NichedOption<T, N1>) -> bool
unsafe fn is_niched(niched: *const NichedOption<T, N1>) -> bool
Source§fn resolve_niched(out: Place<NichedOption<T, N1>>)
fn resolve_niched(out: Place<NichedOption<T, N1>>)
out indicating that a T is niched.