pub struct WordPieceTokenizer { /* private fields */ }Expand description
WordPiece tokenizer (used by BERT).
WordPiece is similar to BPE but uses a different scoring criterion: it maximizes the likelihood of the training data rather than frequency. Subwords (except the first) are prefixed with “##”.
§Algorithm
- Initialize vocabulary with all characters
- Score pairs by: freq(ab) / (freq(a) * freq(b))
- Merge pair with highest score
- Repeat until vocabulary size reached
§Examples
use aprender::text::tokenize::WordPieceTokenizer;
let corpus = vec!["playing", "played", "player", "plays"];
let tokenizer = WordPieceTokenizer::train(&corpus, 50).expect("train");
let tokens = tokenizer.encode("playing").expect("encode");
assert!(!tokens.is_empty());§References
- Wu et al. (2016): Google’s Neural Machine Translation System
- Devlin et al. (2019): BERT: Pre-training of Deep Bidirectional Transformers
Implementations§
Source§impl WordPieceTokenizer
impl WordPieceTokenizer
Sourcepub fn train(corpus: &[&str], vocab_size: usize) -> Result<Self, AprenderError>
pub fn train(corpus: &[&str], vocab_size: usize) -> Result<Self, AprenderError>
Train a WordPiece tokenizer on the given corpus.
§Arguments
corpus- Slice of text documents to train onvocab_size- Target vocabulary size
§Examples
use aprender::text::tokenize::WordPieceTokenizer;
let corpus = vec!["unbelievable", "believable", "believe"];
let tokenizer = WordPieceTokenizer::train(&corpus, 100).expect("train");Sourcepub fn from_vocab(vocab: HashMap<String, u32>) -> Self
pub fn from_vocab(vocab: HashMap<String, u32>) -> Self
Create from pre-built vocabulary.
Sourcepub fn encode(&self, text: &str) -> Result<Vec<u32>, AprenderError>
pub fn encode(&self, text: &str) -> Result<Vec<u32>, AprenderError>
Encode text to token IDs using greedy longest-match-first.
Sourcepub fn decode(&self, ids: &[u32]) -> Result<String, AprenderError>
pub fn decode(&self, ids: &[u32]) -> Result<String, AprenderError>
Decode token IDs back to text.
Sourcepub fn vocab_size(&self) -> usize
pub fn vocab_size(&self) -> usize
Get vocabulary size.
Trait Implementations§
Source§impl Clone for WordPieceTokenizer
impl Clone for WordPieceTokenizer
Source§fn clone(&self) -> WordPieceTokenizer
fn clone(&self) -> WordPieceTokenizer
Returns a duplicate of the value. Read more
1.0.0§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
Performs copy-assignment from
source. Read moreSource§impl Debug for WordPieceTokenizer
impl Debug for WordPieceTokenizer
Auto Trait Implementations§
impl Freeze for WordPieceTokenizer
impl RefUnwindSafe for WordPieceTokenizer
impl Send for WordPieceTokenizer
impl Sync for WordPieceTokenizer
impl Unpin for WordPieceTokenizer
impl UnwindSafe for WordPieceTokenizer
Blanket Implementations§
§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
§unsafe fn clone_to_uninit(&self, dest: *mut u8)
unsafe fn clone_to_uninit(&self, dest: *mut u8)
🔬This is a nightly-only experimental API. (
clone_to_uninit)Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more