pub struct ByteLevelBpeTokenizer {
pub vocab: HashMap<String, u32>,
pub id_to_token: Vec<String>,
pub merges: Vec<(String, String)>,
pub byte_encoder: HashMap<u8, char>,
pub byte_decoder: HashMap<char, u8>,
}Expand description
GPT-2-style byte-level BPE tokenizer.
Every input byte is first mapped to a unique unicode character via the
GPT-2 byte→unicode table, so the BPE algorithm operates on unicode
character sequences. This makes the tokenizer vocabulary guaranteed to be
lossless and eliminates any [UNK] token for arbitrary UTF-8 input.
Fields§
§vocab: HashMap<String, u32>token string → integer id
id_to_token: Vec<String>integer id → token string
merges: Vec<(String, String)>ordered merge rules (left_piece, right_piece)
byte_encoder: HashMap<u8, char>byte → unicode char
byte_decoder: HashMap<char, u8>unicode char → byte (inverse of byte_encoder)
Implementations§
Source§impl ByteLevelBpeTokenizer
impl ByteLevelBpeTokenizer
Sourcepub fn train(texts: &[&str], config: ByteLevelBpeConfig) -> Self
pub fn train(texts: &[&str], config: ByteLevelBpeConfig) -> Self
Train a new ByteLevelBpeTokenizer from raw text slices.
Pre-tokenises on whitespace boundaries and prepends Ġ (U+0120) to
every word that is not at the beginning of the pre-tokenised
sequence.
Source§impl ByteLevelBpeTokenizer
impl ByteLevelBpeTokenizer
Source§impl ByteLevelBpeTokenizer
impl ByteLevelBpeTokenizer
Sourcepub fn save_vocab(&self, vocab_path: &str, merges_path: &str) -> Result<()>
pub fn save_vocab(&self, vocab_path: &str, merges_path: &str) -> Result<()>
Save vocabulary (HuggingFace JSON format) and merge rules to separate files.
The vocab file is a JSON object mapping token strings to integer IDs.
The merges file contains one merge rule per line: left right.
Sourcepub fn load(vocab_path: &str, merges_path: &str) -> Result<Self>
pub fn load(vocab_path: &str, merges_path: &str) -> Result<Self>
Load a tokenizer from a HuggingFace-format vocab JSON and merges text file.
Sourcepub fn vocab_size(&self) -> usize
pub fn vocab_size(&self) -> usize
Return the vocabulary size.
Sourcepub fn id_to_token(&self, id: u32) -> Option<&str>
pub fn id_to_token(&self, id: u32) -> Option<&str>
Look up the token string for an ID.
Sourcepub fn token_to_id(&self, token: &str) -> Option<u32>
pub fn token_to_id(&self, token: &str) -> Option<u32>
Look up the ID for a token string.
Trait Implementations§
Source§impl Clone for ByteLevelBpeTokenizer
impl Clone for ByteLevelBpeTokenizer
Source§fn clone(&self) -> ByteLevelBpeTokenizer
fn clone(&self) -> ByteLevelBpeTokenizer
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreAuto Trait Implementations§
impl Freeze for ByteLevelBpeTokenizer
impl RefUnwindSafe for ByteLevelBpeTokenizer
impl Send for ByteLevelBpeTokenizer
impl Sync for ByteLevelBpeTokenizer
impl Unpin for ByteLevelBpeTokenizer
impl UnsafeUnpin for ByteLevelBpeTokenizer
impl UnwindSafe for ByteLevelBpeTokenizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
Source§fn to_subset(&self) -> Option<SS>
fn to_subset(&self) -> Option<SS>
self from the equivalent element of its
superset. Read moreSource§fn is_in_subset(&self) -> bool
fn is_in_subset(&self) -> bool
self is actually part of its subset T (and can be converted to it).Source§fn to_subset_unchecked(&self) -> SS
fn to_subset_unchecked(&self) -> SS
self.to_subset but without any property checks. Always succeeds.Source§fn from_subset(element: &SS) -> SP
fn from_subset(element: &SS) -> SP
self to the equivalent element of its superset.