Struct charabia::TokenizerBuilder
source · pub struct TokenizerBuilder<'tb, A> { /* private fields */ }
Expand description
Structure to build a tokenizer with custom settings.
To use default settings, use directly the Tokenize
implementation on &str.
§Example
use fst::Set;
use charabia::TokenizerBuilder;
// text to tokenize.
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";
// create the builder.
let mut builder = TokenizerBuilder::new();
// create a set of stop words.
let stop_words: Set<Vec<u8>> = Set::from_iter(["the"].iter()).unwrap();
// configurate stop words.
builder.stop_words(&stop_words);
// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();
Implementations§
source§impl<'tb, A> TokenizerBuilder<'tb, A>
impl<'tb, A> TokenizerBuilder<'tb, A>
sourcepub fn new() -> TokenizerBuilder<'tb, A>
pub fn new() -> TokenizerBuilder<'tb, A>
Create a TokenizerBuilder
with default settings,
if you don’t plan to set stop_words, prefer use TokenizerBuilder::default
source§impl<'tb, A: AsRef<[u8]>> TokenizerBuilder<'tb, A>
impl<'tb, A: AsRef<[u8]>> TokenizerBuilder<'tb, A>
sourcepub fn stop_words(&mut self, stop_words: &'tb Set<A>) -> &mut Self
pub fn stop_words(&mut self, stop_words: &'tb Set<A>) -> &mut Self
Configure the words that will be classified as TokenKind::StopWord
.
§Arguments
stop_words
- aSet
of the words to classify as stop words.
sourcepub fn separators(&mut self, separators: &'tb [&'tb str]) -> &mut Self
pub fn separators(&mut self, separators: &'tb [&'tb str]) -> &mut Self
Configure the words that will be used to separate words and classified as TokenKind::Separator
.
§Arguments
separators
- a slice of str to classify as separator.
§Example
use charabia::TokenizerBuilder;
// create the builder.
let mut builder = TokenizerBuilder::default();
// create a custom list of separators.
let separators = [" ", ", ", ". ", "?", "!"];
// configurate separators.
builder.separators(&separators);
// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();
// text to tokenize.
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";
let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
&output,
&["The", " ", "quick", " ", "(\"brown\")", " ", "fox", " ", "can't", " ", "jump", " ", "32.3", " ", "feet", ", ", "right", "?", " ", "Brr", ", ", "it's", " ", "29.3°F", "!"]
);
sourcepub fn words_dict(&mut self, words: &'tb [&'tb str]) -> &mut Self
pub fn words_dict(&mut self, words: &'tb [&'tb str]) -> &mut Self
Configure the words that will be segmented before any other segmentation.
This words dictionary is used to override the segmentation over these words,
the tokenizer will find all the occurences of these words before any Language based segmentation.
If some of the words are in the stop_words’ list or in the separators’ list,
then they will be categorized as TokenKind::StopWord
or as TokenKind::Separator
aswell.
§Arguments
words
- a slice of str.
§Example
use charabia::TokenizerBuilder;
// create the builder.
let mut builder = TokenizerBuilder::default();
// create a custom list of words.
let words = ["J. R. R.", "Dr.", "J. K."];
// configurate words.
builder.words_dict(&words);
// build the tokenizer passing the text to tokenize.
let tokenizer = builder.build();
// text to tokenize.
let orig = "J. R. R. Tolkien. J. K. Rowling. Dr. Seuss";
let output: Vec<_> = tokenizer.segment_str(orig).collect();
assert_eq!(
&output,
&["J. R. R.", " ", "Tolkien", ". ", "J. K.", " ", "Rowling", ". ", "Dr.", " ", "Seuss"]
);
sourcepub fn create_char_map(&mut self, create_char_map: bool) -> &mut Self
pub fn create_char_map(&mut self, create_char_map: bool) -> &mut Self
Enable or disable the creation of char_map
.
§Arguments
create_char_map
- abool
that indicates whether achar_map
should be created.
sourcepub fn lossy_normalization(&mut self, lossy: bool) -> &mut Self
pub fn lossy_normalization(&mut self, lossy: bool) -> &mut Self
Enable or disable the lossy normalization.
A lossy normalization is a kind of normalization that could change the meaning in some way.
Removing diacritics is considered lossy; for instance, in French the word maïs
(corn
) will be normalized as mais
(but
) which changes the meaning.
§Arguments
lossy
- abool
that enable or disable the lossy normalization.
sourcepub fn allow_list(
&mut self,
allow_list: &'tb HashMap<Script, Vec<Language>>
) -> &mut Self
pub fn allow_list( &mut self, allow_list: &'tb HashMap<Script, Vec<Language>> ) -> &mut Self
Configure which languages can be used for which script
§Arguments
allow_list
- aHashMap
of the selection of languages associated with a script to limit during autodetection.
sourcepub fn into_tokenizer(self) -> Tokenizer<'tb>
pub fn into_tokenizer(self) -> Tokenizer<'tb>
Build the configurated Tokenizer
consumming self.
This method allows to drop the tokenizer builder without having to drop the Tokenizer itself.
Trait Implementations§
Auto Trait Implementations§
impl<'tb, A> Freeze for TokenizerBuilder<'tb, A>
impl<'tb, A> RefUnwindSafe for TokenizerBuilder<'tb, A>where
A: RefUnwindSafe,
impl<'tb, A> Send for TokenizerBuilder<'tb, A>where
A: Sync,
impl<'tb, A> Sync for TokenizerBuilder<'tb, A>where
A: Sync,
impl<'tb, A> Unpin for TokenizerBuilder<'tb, A>
impl<'tb, A> UnwindSafe for TokenizerBuilder<'tb, A>where
A: RefUnwindSafe,
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
source§impl<T> IntoEither for T
impl<T> IntoEither for T
source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moresource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more