pub struct BpeDataset {
pub tokens: Vec<usize>,
pub tokenizer: BpeTokenizer,
}Expand description
BPE-tokenized dataset. Each token is a subword (avg ~4-5 chars). A 128-position context window sees ~500-600 characters instead of 128.
Fields§
§tokens: Vec<usize>§tokenizer: BpeTokenizerImplementations§
Source§impl BpeDataset
impl BpeDataset
Sourcepub fn from_text(text: &str, target_vocab: usize) -> Self
pub fn from_text(text: &str, target_vocab: usize) -> Self
Build from text with BPE tokenization.
target_vocab: number of subword tokens to learn (512, 1024, 2048).
Sourcepub fn from_jsonl(path: &Path, target_vocab: usize) -> Result<Self>
pub fn from_jsonl(path: &Path, target_vocab: usize) -> Result<Self>
Load from a JSONL file with BPE tokenization.
Sourcepub fn from_file(path: &Path, target_vocab: usize) -> Result<Self>
pub fn from_file(path: &Path, target_vocab: usize) -> Result<Self>
Load from a plain text file with BPE tokenization.
pub fn vocab_size(&self) -> usize
pub fn len(&self) -> usize
pub fn decode(&self, tokens: &[usize]) -> String
Auto Trait Implementations§
impl Freeze for BpeDataset
impl RefUnwindSafe for BpeDataset
impl Send for BpeDataset
impl Sync for BpeDataset
impl Unpin for BpeDataset
impl UnsafeUnpin for BpeDataset
impl UnwindSafe for BpeDataset
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more