pub struct WordPieceTokenizer { /* private fields */ }Expand description
A WordPiece tokenizer (BERT-style).
WordPiece is a subword tokenization algorithm that greedily matches the
longest prefix of a word from the vocabulary, using a continuation prefix
(typically "##") for non-initial subwords.
§Example
use scirs2_text::tokenizer::WordPieceTokenizer;
use std::collections::HashMap;
let mut vocab = HashMap::new();
vocab.insert("[UNK]".to_string(), 0);
vocab.insert("hello".to_string(), 1);
vocab.insert("world".to_string(), 2);
vocab.insert("hel".to_string(), 3);
vocab.insert("##lo".to_string(), 4);
let tokenizer = WordPieceTokenizer::new(vocab);
let tokens = tokenizer.tokenize("hello world");
assert!(tokens.contains(&"hello".to_string()) || tokens.contains(&"hel".to_string()));Implementations§
Source§impl WordPieceTokenizer
impl WordPieceTokenizer
Sourcepub fn new(vocab: HashMap<String, u32>) -> Self
pub fn new(vocab: HashMap<String, u32>) -> Self
Create a new WordPiece tokenizer from a vocabulary map.
The vocabulary must contain at least [UNK]. The continuation prefix
defaults to "##" and max word length defaults to 200.
Sourcepub fn with_max_word_len(self, max_len: usize) -> Self
pub fn with_max_word_len(self, max_len: usize) -> Self
Set the maximum word length.
Sourcepub fn with_unk_token(self, unk: &str) -> Self
pub fn with_unk_token(self, unk: &str) -> Self
Set the unknown token string.
Sourcepub fn with_continuing_prefix(self, prefix: &str) -> Self
pub fn with_continuing_prefix(self, prefix: &str) -> Self
Set the continuation prefix (default "##").
Sourcepub fn from_vocab_file(path: &Path) -> Result<Self>
pub fn from_vocab_file(path: &Path) -> Result<Self>
Load a WordPiece vocabulary from a text file (one token per line).
Token IDs are assigned sequentially starting from 0.
Trait Implementations§
Source§impl Clone for WordPieceTokenizer
impl Clone for WordPieceTokenizer
Source§fn clone(&self) -> WordPieceTokenizer
fn clone(&self) -> WordPieceTokenizer
Returns a duplicate of the value. Read more
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
Performs copy-assignment from
source. Read moreSource§impl Debug for WordPieceTokenizer
impl Debug for WordPieceTokenizer
Auto Trait Implementations§
impl Freeze for WordPieceTokenizer
impl RefUnwindSafe for WordPieceTokenizer
impl Send for WordPieceTokenizer
impl Sync for WordPieceTokenizer
impl Unpin for WordPieceTokenizer
impl UnsafeUnpin for WordPieceTokenizer
impl UnwindSafe for WordPieceTokenizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
Source§fn to_subset(&self) -> Option<SS>
fn to_subset(&self) -> Option<SS>
The inverse inclusion map: attempts to construct
self from the equivalent element of its
superset. Read moreSource§fn is_in_subset(&self) -> bool
fn is_in_subset(&self) -> bool
Checks if
self is actually part of its subset T (and can be converted to it).Source§fn to_subset_unchecked(&self) -> SS
fn to_subset_unchecked(&self) -> SS
Use with care! Same as
self.to_subset but without any property checks. Always succeeds.Source§fn from_subset(element: &SS) -> SP
fn from_subset(element: &SS) -> SP
The inclusion map: converts
self to the equivalent element of its superset.