gemini-tokenizer

Community-maintained Gemini tokenizer for Rust, ported from the official Google Python GenAI SDK (v1.6.20).

Disclaimer: This is an unofficial community port. It is not maintained or supported by Google.

Overview

All Gemini models — including gemini-2.0-flash, gemini-2.5-pro, gemini-2.5-flash, gemini-3-pro-preview, and others — use the same tokenizer: the Gemma 3 SentencePiece model with a vocabulary of 262,144 tokens.

This crate embeds that model directly in the binary (via include_bytes!) and provides a fast, local tokenizer that produces identical token counts to the official Google Python SDK. No network access or external files are needed at runtime.

Features

Python SDK parity — API mirrors the Python SDK's LocalTokenizer interface for familiarity.
Embedded model — The SentencePiece model is compiled into your binary. No runtime downloads.
Faithful port — Token counting logic is a direct port of the Python SDK's _TextsAccumulator class from local_tokenizer.py.
Structured content — Count tokens in function calls, function responses, tool declarations, and schemas, not just plain text.
Minimal dependencies — Only sentencepiece, serde, serde_json, and sha2.

Usage

Add to your Cargo.toml:

[dependencies]
gemini-tokenizer = "0.2"

Count tokens in text

use gemini_tokenizer::LocalTokenizer;

let tokenizer = LocalTokenizer::new("gemini-2.5-pro")
    .expect("failed to load tokenizer");

let result = tokenizer.count_tokens("What is your name?", None);
assert_eq!(result.total_tokens, 5);
println!("{}", result); // total_tokens=5

Compute individual tokens

use gemini_tokenizer::LocalTokenizer;

let tokenizer = LocalTokenizer::new("gemini-2.5-pro")
    .expect("failed to load tokenizer");

let result = tokenizer.compute_tokens("Hello, world!");
for info in &result.tokens_info {
    for (id, token) in info.token_ids.iter().zip(&info.tokens) {
        println!("id={}, token={:?}", id, token);
    }
}

Structured content (function calls, tools, schemas)

use gemini_tokenizer::{LocalTokenizer, Content, Part, CountTokensConfig,
    Tool, FunctionDeclaration, Schema};
use std::collections::HashMap;

let tokenizer = LocalTokenizer::new("gemini-2.5-pro")
    .expect("failed to load tokenizer");

// Content with text
let contents = vec![Content {
    role: Some("user".to_string()),
    parts: Some(vec![Part {
        text: Some("What is the weather?".to_string()),
        ..Default::default()
    }]),
}];

// Tool definitions via CountTokensConfig
let config = CountTokensConfig {
    tools: Some(vec![Tool {
        function_declarations: Some(vec![FunctionDeclaration {
            name: Some("get_weather".to_string()),
            description: Some("Gets the current weather for a city".to_string()),
            parameters: Some(Schema {
                schema_type: Some("OBJECT".to_string()),
                properties: Some({
                    let mut props = HashMap::new();
                    props.insert("city".to_string(), Schema {
                        schema_type: Some("STRING".to_string()),
                        description: Some("The city name".to_string()),
                        ..Default::default()
                    });
                    props
                }),
                required: Some(vec!["city".to_string()]),
                ..Default::default()
            }),
            response: None,
        }]),
    }]),
    ..Default::default()
};

let result = tokenizer.count_tokens(contents.as_slice(), Some(&config));
println!("Total tokens: {}", result.total_tokens);

With system instruction

use gemini_tokenizer::{LocalTokenizer, Content, Part, CountTokensConfig};

let tokenizer = LocalTokenizer::new("gemini-2.5-pro")
    .expect("failed to load tokenizer");

let contents = vec![Content {
    role: Some("user".to_string()),
    parts: Some(vec![Part {
        text: Some("Hello".to_string()),
        ..Default::default()
    }]),
}];

let config = CountTokensConfig {
    system_instruction: Some(Content {
        role: Some("system".to_string()),
        parts: Some(vec![Part {
            text: Some("You are a helpful assistant.".to_string()),
            ..Default::default()
        }]),
    }),
    ..Default::default()
};

let result = tokenizer.count_tokens(contents.as_slice(), Some(&config));
println!("{}", result);

How token counting works

The tokenizer extracts countable text from structured objects following the exact same rules as the Google Python SDK's _TextsAccumulator:

Content type	What gets counted
Text parts	The text string itself
Function calls	Function name + all arg keys + all string arg values (recursive)
Function responses	Function name + all response keys + all string response values (recursive)
Tool declarations	Function name + description + recursive schema traversal
Schemas	Format, description, enum values, required fields, property keys, nested schemas

Numbers, booleans, and null values in function arguments are not counted (matching the Python SDK behavior).

Each extracted text segment is tokenized independently and the counts are summed.

Supported models

All supported models use the same underlying Gemma 3 tokenizer. The model name is validated against the same list used by the Python SDK:

gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite
gemini-2.0-flash, gemini-2.0-flash-lite
gemini-2.5-pro-preview-06-05, gemini-2.5-pro-preview-05-06, gemini-2.5-pro-exp-03-25
gemini-live-2.5-flash
gemini-2.5-flash-preview-05-20, gemini-2.5-flash-preview-04-17
gemini-2.5-flash-lite-preview-06-17
gemini-2.0-flash-001, gemini-2.0-flash-lite-001
gemini-3-pro-preview

Use gemini_tokenizer::supported_models() to get the full list programmatically.

Provenance and attribution

This crate is a Rust port of the tokenization logic from the official Google Python GenAI SDK (v1.6.20), specifically:

Text accumulation — _TextsAccumulator class from google/genai/local_tokenizer.py
Model mapping — _GEMINI_MODELS_TO_TOKENIZER_NAMES and _GEMINI_STABLE_MODELS_TO_TOKENIZER_NAMES from google/genai/_local_tokenizer_loader.py
Token-to-bytes conversion — _token_str_to_bytes and _parse_hex_byte from google/genai/local_tokenizer.py

The embedded SentencePiece model file (gemma3_cleaned_262144_v2.spiece.model) is from google/gemma_pytorch at commit 014acb7ac4563a5f77c76d7ff98f31b568c16508, with SHA-256 hash:

1299c11d7cf632ef3b4e11937501358ada021bbdf7c47638d13c0ee982f2e79c

Both upstream projects are licensed under the Apache License, Version 2.0. See the NOTICE file for full attribution details.

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

gemini-tokenizer 0.2.0