Crate gemini_tokenizer

Expand description

§gemini-tokenizer

Authoritative Gemini tokenizer for Rust, ported from the official Google Python SDK (v1.6.20).

All Gemini models (gemini-2.0-flash, gemini-2.5-pro, gemini-3-pro-preview, etc.) use the same tokenizer: the Gemma 3 SentencePiece model with a vocabulary of 262,144 tokens. This crate embeds that model and provides a fast, local tokenizer that matches the official Google Python SDK’s behavior.

§Quick Start

use gemini_tokenizer::LocalTokenizer;

let tokenizer = LocalTokenizer::new("gemini-2.5-pro").expect("failed to load tokenizer");

// Count tokens in plain text
let result = tokenizer.count_tokens("What is your name?", None);
assert_eq!(result.total_tokens, 5);

// Get individual token details
let result = tokenizer.compute_tokens("Hello, world!");
for info in &result.tokens_info {
    for (id, token) in info.token_ids.iter().zip(&info.tokens) {
        println!("id={}, token={:?}", id, token);
    }
}

§Structured Content

The tokenizer also counts tokens in structured Gemini API content objects, matching the Google Python SDK’s _TextsAccumulator logic:

use gemini_tokenizer::{LocalTokenizer, Content, Part, CountTokensConfig, Tool,
    FunctionDeclaration, Schema};

let tokenizer = LocalTokenizer::new("gemini-2.5-pro").expect("failed to load tokenizer");

let contents = vec![Content {
    role: Some("user".to_string()),
    parts: Some(vec![Part {
        text: Some("What is the weather in NYC?".to_string()),
        ..Default::default()
    }]),
}];

let result = tokenizer.count_tokens(contents.as_slice(), None);
assert!(result.total_tokens > 0);

Re-exports§

pub use accumulator::TextAccumulator;
pub use types::*;

Modules§

accumulator: Text accumulation logic ported from the official Google Python SDK.
types: Lightweight types mirroring the Google Gemini API structures needed for token counting.

Structs§

LocalTokenizer: The local Gemini tokenizer.

Enums§

TokenizerError: Errors that can occur when creating or using the tokenizer.

Constants§

MODEL_SHA256: The expected SHA-256 hash of the embedded SentencePiece model.
VOCAB_SIZE: The expected vocabulary size of the Gemma 3 tokenizer.

Functions§

supported_models: Returns the list of supported Gemini model names.
verify_model_hash: Verifies that the embedded model’s SHA-256 hash matches the expected value.