Struct rust_tokenizers::vocab::BpePairVocab
source · pub struct BpePairVocab {
pub values: HashMap<(String, String), i64>,
}
Expand description
Byte pair Encoding Vocab
BPE vocab containing the merges (dictionary of pairs with their priority) used to merge pairs together. This vocabulary element is used on BPE tokenizers such as GPT2 or RoBERTa. This vocabulary is not meant to be used directly, but rather as part of a BPE Tokenizer.
Fields§
§values: HashMap<(String, String), i64>
Implementations§
source§impl BpePairVocab
impl BpePairVocab
sourcepub fn from_file<P: AsRef<Path>>(
path: P
) -> Result<BpePairVocab, TokenizerError>
pub fn from_file<P: AsRef<Path>>( path: P ) -> Result<BpePairVocab, TokenizerError>
Create a new BpePairVocab
from a flat file containing merges in the format first_element second_element
)
The indices are implied by the lien position of each pair in the merges file. The first line needs to be a
header and is skipped.
Example
use rust_tokenizers::vocab::{BpePairVocab, Vocab};
let path = "path/to/file";
let bpe_vocab = BpePairVocab::from_file(path);
sourcepub fn from_sentencepiece_file<P: AsRef<Path>>(
path: P
) -> Result<BpePairVocab, TokenizerError>
pub fn from_sentencepiece_file<P: AsRef<Path>>( path: P ) -> Result<BpePairVocab, TokenizerError>
Create a new BpePairVocab
from a SentencePiece file containing a BPE model.
Example
use rust_tokenizers::vocab::{BpePairVocab, Vocab};
let path = "path/to/spiece.model";
let bpe_vocab = BpePairVocab::from_sentencepiece_file(path);
sourcepub fn byte_pair_to_id(&self, byte_pair: &BpePairRef<'_>) -> Option<&i64>
pub fn byte_pair_to_id(&self, byte_pair: &BpePairRef<'_>) -> Option<&i64>
Gets the id of a “byte pair” in the merges vocab. Returns an optional index for the pair if it is found in the vocabulary.
Example
use rust_tokenizers::vocab::{BpePairRef, BpePairVocab, Vocab};
let path = "path/to/file";
let bpe_vocab = BpePairVocab::from_file(path).unwrap();
let query = BpePairRef {
byte_1: &"won".to_string(),
byte_2: &"derful".to_string(),
};
let id = bpe_vocab.byte_pair_to_id(&query);
Trait Implementations§
source§impl Clone for BpePairVocab
impl Clone for BpePairVocab
source§fn clone(&self) -> BpePairVocab
fn clone(&self) -> BpePairVocab
Returns a copy of the value. Read more
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
Performs copy-assignment from
source
. Read moreAuto Trait Implementations§
impl RefUnwindSafe for BpePairVocab
impl Send for BpePairVocab
impl Sync for BpePairVocab
impl Unpin for BpePairVocab
impl UnwindSafe for BpePairVocab
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more