#[non_exhaustive]pub struct TokenizerConfig {
pub add_bos: bool,
pub add_eos: bool,
pub bos_token_id: u32,
pub eos_token_id: u32,
pub unk_token_id: u32,
pub pad_token_id: u32,
pub max_length: Option<usize>,
pub byte_level_decode: bool,
}Expand description
Configuration knobs for an OxiTokenizer.
Marked #[non_exhaustive] so that new optional knobs can be added in
future minor releases without breaking downstream code. Inside this crate
struct literals with ..Default::default() continue to work.
Fields (Non-exhaustive)§
This struct is marked as non-exhaustive
Struct { .. } syntax; cannot be matched against without a wildcard ..; and struct update syntax will not work.add_bos: boolWhether to prepend a BOS (beginning-of-sequence) token.
add_eos: boolWhether to append an EOS (end-of-sequence) token.
bos_token_id: u32Token ID used for BOS.
eos_token_id: u32Token ID used for EOS.
unk_token_id: u32Token ID used for unknown tokens (fallback).
pad_token_id: u32Token ID used for padding.
max_length: Option<usize>Optional maximum output length (tokens are truncated, not padded).
byte_level_decode: boolWhen true, the decoder applies the GPT-2 bytes ↔ unicode inverse
map to every token string before emitting bytes (see
crate::hf_format). When false, the legacy Ġ-stripping path is
used (same behaviour as 0.1.x).
from_json_file / OxiTokenizer::from_hf_tokenizer_json set this to
true automatically; hand-built configs default to false for
backwards compatibility.
Trait Implementations§
Source§impl Clone for TokenizerConfig
impl Clone for TokenizerConfig
Source§fn clone(&self) -> TokenizerConfig
fn clone(&self) -> TokenizerConfig
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more