pub struct BertTokenizer { /* private fields */ }Expand description
BERT-style tokenizer combining basic tokenization and WordPiece subword segmentation.
Special tokens:
[CLS](classification): prepended to every encoded sequence[SEP](separator): appended after each segment[MASK](masking): placeholder for masked-language-model pre-training[PAD](padding): used to fill sequences to a target length[UNK](unknown): substituted for tokens not present in the vocabulary
§Example
use std::collections::HashMap;
use scirs2_text::tokenizers::bert::BertTokenizer;
let mut vocab: HashMap<String, u32> = HashMap::new();
for (i, tok) in ["[PAD]","[UNK]","[CLS]","[SEP]","[MASK]",
"hello","world","##ing","play","##ed"].iter().enumerate() {
vocab.insert(tok.to_string(), i as u32);
}
let tokenizer = BertTokenizer::new(vocab, true);
let ids = tokenizer.encode("Hello World").unwrap();
assert_eq!(ids[0], tokenizer.cls_token_id());Implementations§
Source§impl BertTokenizer
impl BertTokenizer
Sourcepub fn new(vocab: HashMap<String, u32>, lowercase: bool) -> Self
pub fn new(vocab: HashMap<String, u32>, lowercase: bool) -> Self
Build a BertTokenizer from a token → id vocabulary map.
All five special tokens ([PAD], [UNK], [CLS], [SEP], [MASK])
are inserted into the vocabulary if absent.
Sourcepub fn from_vocab_file(path: &str) -> Result<Self>
pub fn from_vocab_file(path: &str) -> Result<Self>
Load a tokenizer from a vocab.txt file (one token per line; line
index = token ID, 0-based).
Returns an error if the file cannot be read or if the resulting vocabulary is missing required special tokens after auto-insertion.
Sourcepub fn with_max_len(self, max_len: usize) -> Self
pub fn with_max_len(self, max_len: usize) -> Self
Override the maximum sequence length (default 512).
Sourcepub fn cls_token_id(&self) -> u32
pub fn cls_token_id(&self) -> u32
Returns the [CLS] token ID.
Sourcepub fn sep_token_id(&self) -> u32
pub fn sep_token_id(&self) -> u32
Returns the [SEP] token ID.
Sourcepub fn pad_token_id(&self) -> u32
pub fn pad_token_id(&self) -> u32
Returns the [PAD] token ID.
Sourcepub fn mask_token_id(&self) -> u32
pub fn mask_token_id(&self) -> u32
Returns the [MASK] token ID.
Sourcepub fn unk_token_id(&self) -> u32
pub fn unk_token_id(&self) -> u32
Returns the [UNK] token ID.
Sourcepub fn vocab_size(&self) -> usize
pub fn vocab_size(&self) -> usize
Vocabulary size.
Sourcepub fn vocab(&self) -> &HashMap<String, u32>
pub fn vocab(&self) -> &HashMap<String, u32>
Return a reference to the full token → id vocabulary map.
Sourcepub fn tokenize(&self, text: &str) -> Vec<String>
pub fn tokenize(&self, text: &str) -> Vec<String>
Tokenize text into a list of subword strings.
Applies basic tokenization (whitespace + punctuation split, optional
lowercasing) followed by WordPiece subword segmentation. Unknown
characters/words map to "[UNK]".
Sourcepub fn encode(&self, text: &str) -> Result<Vec<u32>>
pub fn encode(&self, text: &str) -> Result<Vec<u32>>
Encode a single text segment as [CLS] tokens [SEP].
Returns the flat sequence of token IDs. Use encode_pair for
two-segment inputs (e.g. question + context).
Sourcepub fn encode_pair(
&self,
text_a: &str,
text_b: &str,
) -> Result<(Vec<u32>, Vec<u32>)>
pub fn encode_pair( &self, text_a: &str, text_b: &str, ) -> Result<(Vec<u32>, Vec<u32>)>
Encode a pair of text segments (e.g. sentence A and sentence B).
Layout: [CLS] A-tokens [SEP] B-tokens [SEP]
Returns (token_ids, token_type_ids) where token_type_ids[i] is 0
for the first segment and 1 for the second.
Sourcepub fn encode_single(
&self,
text: &str,
max_length: usize,
padding: bool,
truncation: bool,
) -> Result<BertEncoding>
pub fn encode_single( &self, text: &str, max_length: usize, padding: bool, truncation: bool, ) -> Result<BertEncoding>
Build a single BertEncoding for text, with optional padding and
truncation to max_length.
If padding is true, short sequences are padded with [PAD] to
reach max_length. If truncation is true, long sequences are
trimmed (preserving [CLS] and [SEP]).
Sourcepub fn encode_batch(
&self,
texts: &[&str],
max_length: usize,
padding: bool,
truncation: bool,
) -> Result<BatchEncoding>
pub fn encode_batch( &self, texts: &[&str], max_length: usize, padding: bool, truncation: bool, ) -> Result<BatchEncoding>
Encode a batch of texts with consistent sequence length.
When padding is true, all sequences in the batch are padded to the
longest (or to max_length, whichever is smaller). When truncation
is true, sequences exceeding max_length are truncated.
Sourcepub fn decode(&self, ids: &[u32]) -> String
pub fn decode(&self, ids: &[u32]) -> String
Decode a sequence of token IDs back to a human-readable string.
Special tokens ([CLS], [SEP], [PAD], [MASK]) are skipped.
WordPiece continuation tokens (prefixed with ##) are merged directly
onto the preceding piece without a space.
Sourcepub fn convert_token_to_id(&self, token: &str) -> Option<u32>
pub fn convert_token_to_id(&self, token: &str) -> Option<u32>
Convert token string to its ID (exposed for testing / downstream use).
Sourcepub fn convert_id_to_token(&self, id: u32) -> Option<&str>
pub fn convert_id_to_token(&self, id: u32) -> Option<&str>
Convert token ID to its string representation.
Trait Implementations§
Source§impl Clone for BertTokenizer
impl Clone for BertTokenizer
Source§fn clone(&self) -> BertTokenizer
fn clone(&self) -> BertTokenizer
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreAuto Trait Implementations§
impl Freeze for BertTokenizer
impl RefUnwindSafe for BertTokenizer
impl Send for BertTokenizer
impl Sync for BertTokenizer
impl Unpin for BertTokenizer
impl UnsafeUnpin for BertTokenizer
impl UnwindSafe for BertTokenizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
Source§fn to_subset(&self) -> Option<SS>
fn to_subset(&self) -> Option<SS>
self from the equivalent element of its
superset. Read moreSource§fn is_in_subset(&self) -> bool
fn is_in_subset(&self) -> bool
self is actually part of its subset T (and can be converted to it).Source§fn to_subset_unchecked(&self) -> SS
fn to_subset_unchecked(&self) -> SS
self.to_subset but without any property checks. Always succeeds.Source§fn from_subset(element: &SS) -> SP
fn from_subset(element: &SS) -> SP
self to the equivalent element of its superset.