Struct charabia::TokenizerBuilder

source ·

pub struct TokenizerBuilder<'tb, A> { /* private fields */ }

Expand description

Structure to build a tokenizer with custom settings.

To use default settings, use directly the Tokenize implementation on &str.

§Example

use fst::Set;

use charabia::TokenizerBuilder;

// text to tokenize.
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

// create the builder.
let mut builder = TokenizerBuilder::new();

// create a set of stop words.
let stop_words: Set<Vec<u8>> = Set::from_iter(["the"].iter()).unwrap();

// configurate stop words.
builder.stop_words(&stop_words);

// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();

Implementations§

source §

impl<'tb, A> TokenizerBuilder<'tb, A>

source

pub fn new() -> TokenizerBuilder<'tb, A>

Create a TokenizerBuilder with default settings,

if you don’t plan to set stop_words, prefer use TokenizerBuilder::default

source §

impl<'tb, A: AsRef<[u8]>> TokenizerBuilder<'tb, A>

source

pub fn stop_words(&mut self, stop_words: &'tb Set<A>) -> &mut Self

Configure the words that will be classified as TokenKind::StopWord.

§Arguments

stop_words - a Set of the words to classify as stop words.

source

pub fn separators(&mut self, separators: &'tb [&'tb str]) -> &mut Self

Configure the words that will be used to separate words and classified as TokenKind::Separator.

§Arguments

separators - a slice of str to classify as separator.

§Example

use charabia::TokenizerBuilder;

// create the builder.
let mut builder = TokenizerBuilder::default();

// create a custom list of separators.
let separators = [" ", ", ", ". ", "?", "!"];

// configurate separators.
builder.separators(&separators);

// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();

// text to tokenize.
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
  &output,
  &["The", " ", "quick", " ", "(\"brown\")", " ", "fox", " ", "can't", " ", "jump", " ", "32.3", " ", "feet", ", ", "right", "?", " ", "Brr", ", ", "it's", " ", "29.3°F", "!"]
);

source

pub fn words_dict(&mut self, words: &'tb [&'tb str]) -> &mut Self

Configure the words that will be segmented before any other segmentation.

This words dictionary is used to override the segmentation over these words, the tokenizer will find all the occurences of these words before any Language based segmentation. If some of the words are in the stop_words’ list or in the separators’ list, then they will be categorized as TokenKind::StopWord or as TokenKind::Separator aswell.

§Arguments

words - a slice of str.

§Example

use charabia::TokenizerBuilder;

// create the builder.
let mut builder = TokenizerBuilder::default();

// create a custom list of words.
let words = ["J. R. R.", "Dr.", "J. K."];

// configurate words.
builder.words_dict(&words);

// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();

// text to tokenize.
let orig = "J. R. R. Tolkien. J. K. Rowling. Dr. Seuss";

let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
  &output,
  &["J. R. R.", " ", "Tolkien", ". ", "J. K.", " ", "Rowling", ". ", "Dr.", " ", "Seuss"]
);

source

pub fn create_char_map(&mut self, create_char_map: bool) -> &mut Self

Enable or disable the creation of char_map.

§Arguments

create_char_map - a bool that indicates whether a char_map should be created.

source

pub fn lossy_normalization(&mut self, lossy: bool) -> &mut Self

Enable or disable the lossy normalization.

A lossy normalization is a kind of normalization that could change the meaning in some way. Removing diacritics is considered lossy; for instance, in French the word maïs (corn) will be normalized as mais (but) which changes the meaning.

§Arguments

lossy - a bool that enable or disable the lossy normalization.

source

pub fn allow_list( &mut self, allow_list: &'tb HashMap<Script, Vec<Language>> ) -> &mut Self

Configure which languages can be used for which script

§Arguments

allow_list - a HashMap of the selection of languages associated with a script to limit during autodetection.

source

pub fn build(&mut self) -> Tokenizer<'_>

Build the configurated Tokenizer.

source

pub fn into_tokenizer(self) -> Tokenizer<'tb>

Build the configurated Tokenizer consumming self.

This method allows to drop the tokenizer builder without having to drop the Tokenizer itself.

Trait Implementations§

source §

impl Default for TokenizerBuilder<'_, Vec<u8>>

source §

fn default() -> Self

Returns the “default value” for a type. Read more

Auto Trait Implementations§

§

impl<'tb, A> UnwindSafe for TokenizerBuilder<'tb, A>
where A: RefUnwindSafe,

Blanket Implementations§

source §

impl<T> Any for T
where T: 'static + ?Sized,

source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

source §

impl<T> Borrow<T> for T
where T: ?Sized,

source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

source §

impl<T> From<T> for T

source §

fn from(t: T) -> T

Returns the argument unchanged.

source §

impl<T, U> Into for T
where U: From<T>,

source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source §

impl<T> IntoEither for T

source §

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

source §

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

source §

impl<T, U> TryFrom for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.

source §

fn try_from(value: U) -> Result<T, <T as TryFrom>::Error>

Performs the conversion.

source §

impl<T, U> TryInto for T
where U: TryFrom<T>,

§

type Error = >::Error

The type returned in the event of a conversion error.

source §

fn try_into(self) -> Result<U, >::Error>

Performs the conversion.

Struct charabia::TokenizerBuilderCopy item path

§Example

Implementations§

impl<'tb, A> TokenizerBuilder<'tb, A>

pub fn new() -> TokenizerBuilder<'tb, A>

impl<'tb, A: AsRef<[u8]>> TokenizerBuilder<'tb, A>

pub fn stop_words(&mut self, stop_words: &'tb Set<A>) -> &mut Self

§Arguments

pub fn separators(&mut self, separators: &'tb [&'tb str]) -> &mut Self

§Arguments

§Example

pub fn words_dict(&mut self, words: &'tb [&'tb str]) -> &mut Self

§Arguments

§Example

pub fn create_char_map(&mut self, create_char_map: bool) -> &mut Self

§Arguments

pub fn lossy_normalization(&mut self, lossy: bool) -> &mut Self

§Arguments

pub fn allow_list( &mut self, allow_list: &'tb HashMap<Script, Vec<Language>> ) -> &mut Self

§Arguments

pub fn build(&mut self) -> Tokenizer<'_>

pub fn into_tokenizer(self) -> Tokenizer<'tb>

Trait Implementations§

impl Default for TokenizerBuilder<'_, Vec<u8>>

fn default() -> Self

Auto Trait Implementations§

impl<'tb, A> Freeze for TokenizerBuilder<'tb, A>

impl<'tb, A> RefUnwindSafe for TokenizerBuilder<'tb, A>where A: RefUnwindSafe,

impl<'tb, A> Send for TokenizerBuilder<'tb, A>where A: Sync,

impl<'tb, A> Sync for TokenizerBuilder<'tb, A>where A: Sync,

impl<'tb, A> Unpin for TokenizerBuilder<'tb, A>

impl<'tb, A> UnwindSafe for TokenizerBuilder<'tb, A>where A: RefUnwindSafe,

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Struct charabia::TokenizerBuilder

impl<'tb, A> RefUnwindSafe for TokenizerBuilder<'tb, A>
where A: RefUnwindSafe,

impl<'tb, A> Send for TokenizerBuilder<'tb, A>
where A: Sync,

impl<'tb, A> Sync for TokenizerBuilder<'tb, A>
where A: Sync,

impl<'tb, A> UnwindSafe for TokenizerBuilder<'tb, A>
where A: RefUnwindSafe,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,