Module tokenizer

Expand description

Tokenizer infrastructure for Hydra model.

Provides a unified trait for tokenization with multiple backend implementations:

Llama3Tokenizer: HuggingFace Tokenizers format (Llama 3, Mistral, etc.)
TiktokenTokenizer: OpenAI tiktoken format (cl100k, o200k)
FallbackTokenizer: Simple byte-level fallback

§Example

use m2m::inference::{HydraTokenizer, Llama3Tokenizer, TokenizerType};

// Load Llama 3 tokenizer from file
let tokenizer = Llama3Tokenizer::from_file("./models/hydra/tokenizer.json")?;

// Encode text to token IDs
let tokens = tokenizer.encode("Hello, world!")?;

// Get vocab size (128K for Llama 3)
assert_eq!(tokenizer.vocab_size(), 128000);

Structs§

FallbackTokenizer: Fallback byte-level tokenizer.
HydraByteTokenizer: Byte-level tokenizer that matches Hydra’s training tokenizer.
Llama3Tokenizer: Llama 3 tokenizer using HuggingFace Tokenizers library.
TiktokenTokenizer: OpenAI tiktoken-based tokenizer.

Enums§

TokenizerType: Tokenizer type identifier

Constants§

MAX_SEQUENCE_LENGTH: Maximum sequence length for Hydra input

Traits§

HydraTokenizer: Trait for tokenizers used by Hydra model.

Functions§

boxed: Create a boxed tokenizer from a specific implementation.
load_tokenizer: Load the best available tokenizer for Hydra.
load_tokenizer_by_type: Load tokenizer by type.

Type Aliases§

BoxedTokenizer: Type-erased tokenizer for storing different tokenizer implementations.

Module tokenizer

Module tokenizer Copy item path

§Example

Structs§

Enums§

Constants§

Traits§

Functions§

Type Aliases§

Module tokenizer