Skip to main content

Crate gemini_tokenizer

Crate gemini_tokenizer 

Source
Expand description

§gemini-tokenizer

Authoritative Gemini tokenizer for Rust, ported from the official Google Python SDK (v1.6.20).

All Gemini models (gemini-2.0-flash, gemini-2.5-pro, gemini-3-pro-preview, etc.) use the same tokenizer: the Gemma 3 SentencePiece model with a vocabulary of 262,144 tokens. This crate embeds that model and provides a fast, local tokenizer that matches the official Google Python SDK’s behavior.

§Quick Start

use gemini_tokenizer::LocalTokenizer;

let tokenizer = LocalTokenizer::new("gemini-2.5-pro").expect("failed to load tokenizer");

// Count tokens in plain text
let result = tokenizer.count_tokens("What is your name?", None);
assert_eq!(result.total_tokens, 5);

// Get individual token details
let result = tokenizer.compute_tokens("Hello, world!");
for info in &result.tokens_info {
    for (id, token) in info.token_ids.iter().zip(&info.tokens) {
        println!("id={}, token={:?}", id, token);
    }
}

§Structured Content

The tokenizer also counts tokens in structured Gemini API content objects, matching the Google Python SDK’s _TextsAccumulator logic:

use gemini_tokenizer::{LocalTokenizer, Content, Part, CountTokensConfig, Tool,
    FunctionDeclaration, Schema};

let tokenizer = LocalTokenizer::new("gemini-2.5-pro").expect("failed to load tokenizer");

let contents = vec![Content {
    role: Some("user".to_string()),
    parts: Some(vec![Part {
        text: Some("What is the weather in NYC?".to_string()),
        ..Default::default()
    }]),
}];

let result = tokenizer.count_tokens(contents.as_slice(), None);
assert!(result.total_tokens > 0);

Re-exports§

pub use accumulator::TextAccumulator;
pub use types::*;

Modules§

accumulator
Text accumulation logic ported from the official Google Python SDK.
types
Lightweight types mirroring the Google Gemini API structures needed for token counting.

Structs§

LocalTokenizer
The local Gemini tokenizer.

Enums§

TokenizerError
Errors that can occur when creating or using the tokenizer.

Constants§

MODEL_SHA256
The expected SHA-256 hash of the embedded SentencePiece model.
VOCAB_SIZE
The expected vocabulary size of the Gemma 3 tokenizer.

Functions§

supported_models
Returns the list of supported Gemini model names.
verify_model_hash
Verifies that the embedded model’s SHA-256 hash matches the expected value.