pub struct NativeTokenizerBridge { /* private fields */ }Expand description
Bridge between the inference engine and OxiTokenizer.
Provides encode/decode/chat-format operations backed by the project-native
pure Rust BPE tokenizer. The bridge is Send + Sync and holds no mutable
state after construction.
Implementations§
Source§impl NativeTokenizerBridge
impl NativeTokenizerBridge
Sourcepub fn new(tokenizer: OxiTokenizer) -> Self
pub fn new(tokenizer: OxiTokenizer) -> Self
Create a bridge wrapping the provided OxiTokenizer, with no chat
template.
Use NativeTokenizerBridge::with_chatml if you need ChatML
formatting (e.g. for Qwen3 models).
Sourcepub fn char_level_fallback() -> Self
pub fn char_level_fallback() -> Self
Create a minimal char-level fallback tokenizer.
This uses OxiTokenizer::char_level_stub with a 512-token vocabulary
and attaches no chat template. Useful for unit tests and smoke-checks
where a real vocab file is not required.
Sourcepub fn with_chatml(tokenizer: OxiTokenizer) -> Self
pub fn with_chatml(tokenizer: OxiTokenizer) -> Self
Create a bridge with a ChatML template pre-configured.
This is the correct constructor for Qwen3 / OxiBonsai models, which
use the <|im_start|>role\ncontent<|im_end|> format.
Sourcepub fn char_level_fallback_with_chatml() -> Self
pub fn char_level_fallback_with_chatml() -> Self
Create a char-level fallback tokenizer with a ChatML template.
Convenience constructor that combines char_level_fallback and
with_chatml — handy for tests that exercise the chat-formatting
path without a real vocab file.
Sourcepub fn from_json(
vocab_json: &str,
merges_json: &str,
config: TokenizerConfig,
) -> Result<Self, NativeTokenizerError>
pub fn from_json( vocab_json: &str, merges_json: &str, config: TokenizerConfig, ) -> Result<Self, NativeTokenizerError>
Create a bridge from a JSON-serialized vocabulary and merge table, using the supplied configuration.
vocab_json: { "token": id, … }
merges_json: [["a", "b"], …] (highest-priority merge first)
Sourcepub fn encode(&self, text: &str) -> Result<Vec<u32>, NativeTokenizerError>
pub fn encode(&self, text: &str) -> Result<Vec<u32>, NativeTokenizerError>
Encode a text string into a sequence of token IDs.
Delegates directly to OxiTokenizer::encode.
Sourcepub fn decode(&self, ids: &[u32]) -> Result<String, NativeTokenizerError>
pub fn decode(&self, ids: &[u32]) -> Result<String, NativeTokenizerError>
Decode a sequence of token IDs back into a UTF-8 string.
Special tokens (BOS, EOS, PAD, UNK) are silently skipped.
Unknown IDs produce \u{FFFD} (replacement character).
Sourcepub fn decode_token(&self, id: u32) -> Result<String, NativeTokenizerError>
pub fn decode_token(&self, id: u32) -> Result<String, NativeTokenizerError>
Decode a single token ID to its string representation.
Sourcepub fn encode_batch(
&self,
texts: &[&str],
) -> Result<Vec<Vec<u32>>, NativeTokenizerError>
pub fn encode_batch( &self, texts: &[&str], ) -> Result<Vec<Vec<u32>>, NativeTokenizerError>
Encode a batch of texts, returning one Vec<u32> per input.
Sourcepub fn format_chat(
&self,
messages: &[(&str, &str)],
) -> Result<String, NativeTokenizerError>
pub fn format_chat( &self, messages: &[(&str, &str)], ) -> Result<String, NativeTokenizerError>
Format a list of (role, content) pairs into a single prompt string
using the configured chat template.
Returns NativeTokenizerError::NoChatTemplate if no template was
provided at construction time.
§Example
use oxibonsai_runtime::native_tokenizer::NativeTokenizerBridge;
let bridge = NativeTokenizerBridge::char_level_fallback_with_chatml();
let prompt = bridge
.format_chat(&[("user", "Hello!")])
.expect("format_chat should succeed");
assert!(prompt.contains("<|im_start|>user"));Sourcepub fn vocab_size(&self) -> usize
pub fn vocab_size(&self) -> usize
Return the total vocabulary size.
Sourcepub fn bos_id(&self) -> u32
pub fn bos_id(&self) -> u32
Return the BOS token ID from the underlying tokenizer configuration.
Sourcepub fn eos_id(&self) -> u32
pub fn eos_id(&self) -> u32
Return the EOS token ID from the underlying tokenizer configuration.
Sourcepub fn is_special(&self, id: u32) -> bool
pub fn is_special(&self, id: u32) -> bool
Return true if the given token ID is a special token (BOS/EOS/PAD/UNK).
Sourcepub fn inner(&self) -> &OxiTokenizer
pub fn inner(&self) -> &OxiTokenizer
Return a reference to the underlying OxiTokenizer.
Sourcepub fn chat_template(&self) -> Option<&ChatTemplate>
pub fn chat_template(&self) -> Option<&ChatTemplate>
Return a reference to the configured ChatTemplate, if any.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for NativeTokenizerBridge
impl RefUnwindSafe for NativeTokenizerBridge
impl Send for NativeTokenizerBridge
impl Sync for NativeTokenizerBridge
impl Unpin for NativeTokenizerBridge
impl UnsafeUnpin for NativeTokenizerBridge
impl UnwindSafe for NativeTokenizerBridge
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more