Skip to main content

Module tokenizers

Module tokenizers 

Source
Expand description

Tokenizer implementations for various LLM models

This module provides the core tokenization functionality for supported LLM models.

§Architecture

The tokenization system uses a trait-based design for extensibility:

§Example

use token_count::tokenizers::registry::ModelRegistry;

// Get the global model registry
let registry = ModelRegistry::global();

// Get a tokenizer for a specific model
let tokenizer = registry.get_tokenizer("gpt-4", false).unwrap();

// Count tokens
let count = tokenizer.count_tokens("Hello world").unwrap();
assert_eq!(count, 2);

// Get model information
let info = tokenizer.get_model_info();
assert_eq!(info.name, "gpt-4");
assert_eq!(info.encoding, "cl100k_base");

§Supported Models

Currently supports:

  • OpenAI models: GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, GPT-4o
  • Claude models: Claude 4.0-4.6 (Opus, Sonnet, Haiku variants)

See registry::ModelRegistry for model configuration and aliases.

Modules§

claude
Tokenizer implementation for Anthropic Claude models
openai
OpenAI tokenization using tiktoken-rs
registry
Model registry for managing supported models

Structs§

ModelInfo
Information about a tokenization model
TokenizationResult
Result of tokenization operation

Enums§

TokenCount
Result of token counting, indicating whether count is estimated or exact

Traits§

Tokenizer
Trait for tokenizing text with a specific model