[][src]Struct notmecab::Dict

pub struct Dict { /* fields omitted */ }

Implementations

impl Dict[src]

pub fn load(
    sysdic: Blob,
    unkdic: Blob,
    matrix: Blob,
    unkchar: Blob
) -> Result<Dict, &'static str>
[src]

Load sys.dic and matrix.bin files into memory and prepare the data that's stored in them to be used by the parser.

Returns a Dict or, on error, a string describing an error that prevented the Dict from being created.

Only supports UTF-8 mecab dictionaries with a version number of 0x66.

Ensures that sys.dic and matrix.bin have compatible connection matrix sizes.

pub fn load_user_dictionary(
    &mut self,
    userdic: Blob
) -> Result<(), &'static str>
[src]

Load a user dictionary, comma-separated fields.

The first four fields are the surface, left context ID, right context ID, and cost of the token.

Everything past the fourth comma is treated as pure text and is the token's feature string. It is itself normally a list of comma-separated fields with the same format as the feature strings of the main mecab dictionary.

pub fn read_feature_string(&self, token: &LexerToken) -> &str[src]

Returns the feature string belonging to a LexerToken.

pub fn read_feature_string_by_source(
    &self,
    kind: TokenType,
    offset: u32
) -> &str
[src]

Calling this with values not taken from a real token is unsupported behavior.

pub fn prepare_fast_matrix_cache(
    &mut self,
    fast_left_edges: Vec<u16>,
    fast_right_edges: Vec<u16>
)
[src]

Optional feature for applications that need to use as little memory as possible without accessing disk constantly. "Undocumented". May be removed at any time for any reason.

Does nothing if the prepare_full_matrix_cache has already been called.

pub fn prepare_full_matrix_cache(&mut self)[src]

Load the entire connection matrix into memory. Suitable for small dictionaries, but is actually SLOWER than using prepare_fast_matrix_cache properly for extremely large dictionaries, like modern versions of unidic. "Undocumented".

Overrides prepare_fast_matrix_cache if it has been called before.

pub fn tokenize(
    &self,
    text: &str
) -> Result<(Vec<LexerToken>, i64), TokenizeError>
[src]

Tokenizes a string by creating a lattice of possible tokens over it and finding the lowest-cost path thought that lattice.

See Dict::tokenize_with_cache for more details.

pub fn tokenize_with_cache(
    &self,
    cache: &mut Cache,
    text: &str,
    output: &mut Vec<LexerToken>
) -> Result<i64, TokenizeError>
[src]

Tokenizes a string by creating a lattice of possible tokens over it and finding the lowest-cost path thought that lattice.

If successful the contents of output will be replaced with a list of tokens and the total cost of the tokenization will be returned. If unsuccessful the output will be cleared and a None will be returned.

The dictionary itself defines what tokens exist, how they appear in the string, their costs, and the costs of their possible connections.

It's possible for multiple paths to tie for the lowest cost. It's not defined which path is returned in that case.

If you'll be calling this method multiple times you should reuse the same Cache object across multiple invocations for increased efficiency.

pub fn set_space_stripping(&mut self, setting: bool) -> bool[src]

Set whether the 0x20 whitespace stripping behavior is enabled. Returns the previous value of the setting.

Enabled by default.

When enabled, spaces are virtually added to the front of the next token/tokens during lattice construction. This has the effect of turning 0x20 whitespace sequences into forced separators without affecting connection costs, but makes it slightly more difficult to reconstruct the exact original text from the output of the parser.

pub fn set_unk_forced_processing(&mut self, setting: bool) -> bool[src]

Set whether support for forced unknown token processing is enabled. Returns the previous value of the setting.

Enabled by default.

When the parser's input string has locations where no entries can be found in the dictionary, the parser has to fill that location with unknown tokens. The unknown tokens are made by grouping up as many compatible characters as possible AND/OR grouping up every group of compatible characters from a length of 1 to a length of N. Whether either type of grouping is done (and how long the maximum prefix group is) is specified for each character in the unknown character data (usually char.bin).

The unknown character data can also specify that certain character types always trigger grouping into unknown tokens, even if the given location in the input string can be found in a normal dictionary. Disabling this setting will override that data and cause the lattice builder to ONLY create unknown tokens when nothing can be found in a normal dictionary.

If all unknown character processing at some problematic point in the input string fails for some reason, such as a defective unknown character data file, or one or both of the grouping modes being disabled, then that problematic point in the input string will create a single-character unknown token.

When enabled, the unknown character data's flag for forcing processing is observed. When disabled, it is ignored, and processing is never forced.

pub fn set_unk_greedy_grouping(&mut self, setting: bool) -> bool[src]

Set whether greedy grouping behavior is enabled. Returns the previous value of the setting.

Enabled by default.

When enabled, problematic locations in the input string will (if specified in the unknown character data) be greedily grouped into an unknown token, covering all compatible characters.

Note that this does not prevent real words inside of the grouping from being detected once the lattice constructor comes around to them, which means that greedy grouping does not necessarily override prefix grouping, and for some character types, the unknown character data will have both greedy grouping and prefix grouping enabled.

pub fn set_unk_prefix_grouping(&mut self, setting: bool) -> bool[src]

Set whether greedy grouping behavior is enabled. Returns the previous value of the setting.

Enabled by default. See the documentation for the other set_unk_ functions for an explanation of what unknown token prefix grouping is.

Auto Trait Implementations

impl !RefUnwindSafe for Dict

impl Send for Dict

impl Sync for Dict

impl Unpin for Dict

impl !UnwindSafe for Dict

Blanket Implementations

impl<T> Any for T where
    T: 'static + ?Sized
[src]

impl<T> Borrow<T> for T where
    T: ?Sized
[src]

impl<T> BorrowMut<T> for T where
    T: ?Sized
[src]

impl<T> From<T> for T[src]

impl<T, U> Into<U> for T where
    U: From<T>, 
[src]

impl<T, U> TryFrom<U> for T where
    U: Into<T>, 
[src]

type Error = Infallible

The type returned in the event of a conversion error.

impl<T, U> TryInto<U> for T where
    U: TryFrom<T>, 
[src]

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.