pub struct ParallelRegexBPETokenizerConfig { /* private fields */ }Expand description
Configuration for training a regex-based BPE tokenizer in parallel with rayon.
This struct implements Trainable and produces a ParallelRegexBPETokenizer.
Implementations§
Source§impl ParallelRegexBPETokenizerConfig
impl ParallelRegexBPETokenizerConfig
Sourcepub fn build(
vocab_size: u32,
pattern: Option<&str>,
) -> Result<Self, RegexBPETokenizerConfigError>
pub fn build( vocab_size: u32, pattern: Option<&str>, ) -> Result<Self, RegexBPETokenizerConfigError>
Create a new configuration for training a regex BPE tokenizer.
§Arguments
vocab_size- The desired vocabulary size (must be at least 256).pattern- Optional custom regex pattern. IfNone, uses the GPT-4 split pattern.
§Returns
Ok(RegexBPETokenizerConfig)if configuration is valid.Err(RegexBPETokenizerConfigError)if vocab size is too small or pattern is invalid.
Sourcepub fn from_merges(
merges: u32,
pattern: Option<&str>,
) -> Result<Self, RegexBPETokenizerConfigError>
pub fn from_merges( merges: u32, pattern: Option<&str>, ) -> Result<Self, RegexBPETokenizerConfigError>
Create a new configuration from the number of merges instead of vocab size.
§Arguments
merges- The number of merge operations to perform.pattern- Optional custom regex pattern.
Trait Implementations§
Source§impl Deserializable for ParallelRegexBPETokenizerConfig
impl Deserializable for ParallelRegexBPETokenizerConfig
Source§fn load(&self, path: &Path) -> Result<Self::Output, Error>
fn load(&self, path: &Path) -> Result<Self::Output, Error>
Loads a ParallelRegexBPETokenizer from a file.
The file must contain a header line with the regex pattern, followed by merge rules (one per line).
§Arguments
path- The path to load the tokenizer from. Must have a.stokextension.
§Returns
Ok(ParallelRegexBPETokenizer)if the tokenizer was loaded successfully.Err(std::io::Error)if the file extension is invalid, reading fails, or the file format is invalid.
Source§type Output = ParallelRegexBPETokenizer
type Output = ParallelRegexBPETokenizer
The tokenizer type produced by loading.
Source§impl Trainable for ParallelRegexBPETokenizerConfig
impl Trainable for ParallelRegexBPETokenizerConfig
Source§type Output = ParallelRegexBPETokenizer
type Output = ParallelRegexBPETokenizer
The tokenizer type produced by training.
Source§type TrainingError = Infallible
type TrainingError = Infallible
Error that could happen during training.
Source§fn train(
&self,
dataset: &str,
) -> Result<ParallelRegexBPETokenizer, Self::TrainingError>
fn train( &self, dataset: &str, ) -> Result<ParallelRegexBPETokenizer, Self::TrainingError>
Trains a tokenizer on a given dataset to a target vocabulary size. Read more
Auto Trait Implementations§
impl Freeze for ParallelRegexBPETokenizerConfig
impl RefUnwindSafe for ParallelRegexBPETokenizerConfig
impl Send for ParallelRegexBPETokenizerConfig
impl Sync for ParallelRegexBPETokenizerConfig
impl Unpin for ParallelRegexBPETokenizerConfig
impl UnwindSafe for ParallelRegexBPETokenizerConfig
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more