Expand description
Tokenizer infrastructure for Hydra model.
Provides a unified trait for tokenization with multiple backend implementations:
Llama3Tokenizer: HuggingFace Tokenizers format (Llama 3, Mistral, etc.)TiktokenTokenizer: OpenAI tiktoken format (cl100k, o200k)FallbackTokenizer: Simple byte-level fallback
§Example
ⓘ
use m2m::inference::{HydraTokenizer, Llama3Tokenizer, TokenizerType};
// Load Llama 3 tokenizer from file
let tokenizer = Llama3Tokenizer::from_file("./models/hydra/tokenizer.json")?;
// Encode text to token IDs
let tokens = tokenizer.encode("Hello, world!")?;
// Get vocab size (128K for Llama 3)
assert_eq!(tokenizer.vocab_size(), 128000);Structs§
- Fallback
Tokenizer - Fallback byte-level tokenizer.
- Hydra
Byte Tokenizer - Byte-level tokenizer that matches Hydra’s training tokenizer.
- Llama3
Tokenizer - Llama 3 tokenizer using HuggingFace Tokenizers library.
- Tiktoken
Tokenizer - OpenAI tiktoken-based tokenizer.
Enums§
- Tokenizer
Type - Tokenizer type identifier
Constants§
- MAX_
SEQUENCE_ LENGTH - Maximum sequence length for Hydra input
Traits§
- Hydra
Tokenizer - Trait for tokenizers used by Hydra model.
Functions§
- boxed
- Create a boxed tokenizer from a specific implementation.
- load_
tokenizer - Load the best available tokenizer for Hydra.
- load_
tokenizer_ by_ type - Load tokenizer by type.
Type Aliases§
- Boxed
Tokenizer - Type-erased tokenizer for storing different tokenizer implementations.