Module tokenizer

Module tokenizer 

Source
Expand description

Tokenizer infrastructure for Hydra model.

Provides a unified trait for tokenization with multiple backend implementations:

§Example

use m2m::inference::{HydraTokenizer, Llama3Tokenizer, TokenizerType};

// Load Llama 3 tokenizer from file
let tokenizer = Llama3Tokenizer::from_file("./models/hydra/tokenizer.json")?;

// Encode text to token IDs
let tokens = tokenizer.encode("Hello, world!")?;

// Get vocab size (128K for Llama 3)
assert_eq!(tokenizer.vocab_size(), 128000);

Structs§

FallbackTokenizer
Fallback byte-level tokenizer.
HydraByteTokenizer
Byte-level tokenizer that matches Hydra’s training tokenizer.
Llama3Tokenizer
Llama 3 tokenizer using HuggingFace Tokenizers library.
TiktokenTokenizer
OpenAI tiktoken-based tokenizer.

Enums§

TokenizerType
Tokenizer type identifier

Constants§

MAX_SEQUENCE_LENGTH
Maximum sequence length for Hydra input

Traits§

HydraTokenizer
Trait for tokenizers used by Hydra model.

Functions§

boxed
Create a boxed tokenizer from a specific implementation.
load_tokenizer
Load the best available tokenizer for Hydra.
load_tokenizer_by_type
Load tokenizer by type.

Type Aliases§

BoxedTokenizer
Type-erased tokenizer for storing different tokenizer implementations.