Struct charabia::TokenizerBuilder

source ·
pub struct TokenizerBuilder<'tb, A> { /* private fields */ }
Expand description

Structure to build a tokenizer with custom settings.

To use default settings, use directly the Tokenize implementation on &str.

§Example

use fst::Set;

use charabia::TokenizerBuilder;

// text to tokenize.
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

// create the builder.
let mut builder = TokenizerBuilder::new();

// create a set of stop words.
let stop_words: Set<Vec<u8>> = Set::from_iter(["the"].iter()).unwrap();

// configurate stop words.
builder.stop_words(&stop_words);

// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();

Implementations§

source§

impl<'tb, A> TokenizerBuilder<'tb, A>

source

pub fn new() -> TokenizerBuilder<'tb, A>

Create a TokenizerBuilder with default settings,

if you don’t plan to set stop_words, prefer use TokenizerBuilder::default

source§

impl<'tb, A: AsRef<[u8]>> TokenizerBuilder<'tb, A>

source

pub fn stop_words(&mut self, stop_words: &'tb Set<A>) -> &mut Self

Configure the words that will be classified as TokenKind::StopWord.

§Arguments
  • stop_words - a Set of the words to classify as stop words.
source

pub fn separators(&mut self, separators: &'tb [&'tb str]) -> &mut Self

Configure the words that will be used to separate words and classified as TokenKind::Separator.

§Arguments
  • separators - a slice of str to classify as separator.
§Example
use charabia::TokenizerBuilder;

// create the builder.
let mut builder = TokenizerBuilder::default();

// create a custom list of separators.
let separators = [" ", ", ", ". ", "?", "!"];

// configurate separators.
builder.separators(&separators);

// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();

// text to tokenize.
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
  &output,
  &["The", " ", "quick", " ", "(\"brown\")", " ", "fox", " ", "can't", " ", "jump", " ", "32.3", " ", "feet", ", ", "right", "?", " ", "Brr", ", ", "it's", " ", "29.3°F", "!"]
);
source

pub fn words_dict(&mut self, words: &'tb [&'tb str]) -> &mut Self

Configure the words that will be segmented before any other segmentation.

This words dictionary is used to override the segmentation over these words, the tokenizer will find all the occurences of these words before any Language based segmentation. If some of the words are in the stop_words’ list or in the separators’ list, then they will be categorized as TokenKind::StopWord or as TokenKind::Separator aswell.

§Arguments
  • words - a slice of str.
§Example
use charabia::TokenizerBuilder;

// create the builder.
let mut builder = TokenizerBuilder::default();

// create a custom list of words.
let words = ["J. R. R.", "Dr.", "J. K."];

// configurate words.
builder.words_dict(&words);

// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();

// text to tokenize.
let orig = "J. R. R. Tolkien. J. K. Rowling. Dr. Seuss";

let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
  &output,
  &["J. R. R.", " ", "Tolkien", ". ", "J. K.", " ", "Rowling", ". ", "Dr.", " ", "Seuss"]
);
source

pub fn create_char_map(&mut self, create_char_map: bool) -> &mut Self

Enable or disable the creation of char_map.

§Arguments
  • create_char_map - a bool that indicates whether a char_map should be created.
source

pub fn lossy_normalization(&mut self, lossy: bool) -> &mut Self

Enable or disable the lossy normalization.

A lossy normalization is a kind of normalization that could change the meaning in some way. Removing diacritics is considered lossy; for instance, in French the word maïs (corn) will be normalized as mais (but) which changes the meaning.

§Arguments
  • lossy - a bool that enable or disable the lossy normalization.
source

pub fn allow_list( &mut self, allow_list: &'tb HashMap<Script, Vec<Language>> ) -> &mut Self

Configure which languages can be used for which script

§Arguments
  • allow_list - a HashMap of the selection of languages associated with a script to limit during autodetection.
source

pub fn build(&mut self) -> Tokenizer<'_>

Build the configurated Tokenizer.

source

pub fn into_tokenizer(self) -> Tokenizer<'tb>

Build the configurated Tokenizer consumming self.

This method allows to drop the tokenizer builder without having to drop the Tokenizer itself.

Trait Implementations§

source§

impl Default for TokenizerBuilder<'_, Vec<u8>>

source§

fn default() -> Self

Returns the “default value” for a type. Read more

Auto Trait Implementations§

§

impl<'tb, A> Freeze for TokenizerBuilder<'tb, A>

§

impl<'tb, A> RefUnwindSafe for TokenizerBuilder<'tb, A>
where A: RefUnwindSafe,

§

impl<'tb, A> Send for TokenizerBuilder<'tb, A>
where A: Sync,

§

impl<'tb, A> Sync for TokenizerBuilder<'tb, A>
where A: Sync,

§

impl<'tb, A> Unpin for TokenizerBuilder<'tb, A>

§

impl<'tb, A> UnwindSafe for TokenizerBuilder<'tb, A>
where A: RefUnwindSafe,

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> IntoEither for T

source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.