pub struct TokenizerBuilder<'al, 'sw, A> { /* private fields */ }
Expand description

Structure to build a tokenizer with custom settings.

To use default settings, use directly the Tokenize implementation on &str.

Example

use fst::Set;

use charabia::TokenizerBuilder;

// text to tokenize.
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

// create the builder.
let mut builder = TokenizerBuilder::new();

// create a set of stop words.
let stop_words = Set::from_iter(["the"].iter()).unwrap();

// configurate stop words.
builder.stop_words(&stop_words);

// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();

Implementations§

Create a TokenizerBuilder with default settings,

if you don’t plan to set stop_words, prefer use TokenizerBuilder::default

Configure the words that will be classified as TokenKind::StopWord.

Arguments
  • stop_words - a Set of the words to classify as stop words.

Enable or disable the creation of char_map.

Arguments
  • create_char_map - a bool that indicates whether a char_map should be created.

Configure which languages can be used for which script

Arguments
  • allow_list - a HashMap of the selection of languages associated with a script to limit during autodetection.

Build the configurated Tokenizer.

Trait Implementations§

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

Gets the TypeId of self. Read more
Immutably borrows from an owned value. Read more
Mutably borrows from an owned value. Read more

Returns the argument unchanged.

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

The type returned in the event of a conversion error.
Performs the conversion.
The type returned in the event of a conversion error.
Performs the conversion.