gemini-tokenizer
Community-maintained Gemini tokenizer for Rust, ported from the official Google Python GenAI SDK (v1.6.20).
Disclaimer: This is an unofficial community port. It is not maintained or supported by Google.
Overview
All Gemini models — including gemini-2.0-flash, gemini-2.5-pro, gemini-2.5-flash, gemini-3-pro-preview, and others — use the same tokenizer: the Gemma 3 SentencePiece model with a vocabulary of 262,144 tokens.
This crate embeds that model directly in the binary (via include_bytes!) and
provides a fast, local tokenizer that produces identical token counts to the
official Google Python SDK. No network access or external files are needed at
runtime.
Features
- Python SDK parity — API mirrors the Python SDK's
LocalTokenizerinterface for familiarity. - Embedded model — The SentencePiece model is compiled into your binary. No runtime downloads.
- Faithful port — Token counting logic is a direct port of the Python SDK's
_TextsAccumulatorclass fromlocal_tokenizer.py. - Structured content — Count tokens in function calls, function responses, tool declarations, and schemas, not just plain text.
- Minimal dependencies — Only
sentencepiece,serde,serde_json, andsha2.
Usage
Add to your Cargo.toml:
[]
= "0.2"
Count tokens in text
use LocalTokenizer;
let tokenizer = new
.expect;
let result = tokenizer.count_tokens;
assert_eq!;
println!; // total_tokens=5
Compute individual tokens
use LocalTokenizer;
let tokenizer = new
.expect;
let result = tokenizer.compute_tokens;
for info in &result.tokens_info
Structured content (function calls, tools, schemas)
use ;
use HashMap;
let tokenizer = new
.expect;
// Content with text
let contents = vec!;
// Tool definitions via CountTokensConfig
let config = CountTokensConfig ;
let result = tokenizer.count_tokens;
println!;
With system instruction
use ;
let tokenizer = new
.expect;
let contents = vec!;
let config = CountTokensConfig ;
let result = tokenizer.count_tokens;
println!;
How token counting works
The tokenizer extracts countable text from structured objects following the exact
same rules as the Google Python SDK's _TextsAccumulator:
| Content type | What gets counted |
|---|---|
| Text parts | The text string itself |
| Function calls | Function name + all arg keys + all string arg values (recursive) |
| Function responses | Function name + all response keys + all string response values (recursive) |
| Tool declarations | Function name + description + recursive schema traversal |
| Schemas | Format, description, enum values, required fields, property keys, nested schemas |
Numbers, booleans, and null values in function arguments are not counted (matching the Python SDK behavior).
Each extracted text segment is tokenized independently and the counts are summed.
Supported models
All supported models use the same underlying Gemma 3 tokenizer. The model name is validated against the same list used by the Python SDK:
gemini-2.5-pro,gemini-2.5-flash,gemini-2.5-flash-litegemini-2.0-flash,gemini-2.0-flash-litegemini-2.5-pro-preview-06-05,gemini-2.5-pro-preview-05-06,gemini-2.5-pro-exp-03-25gemini-live-2.5-flashgemini-2.5-flash-preview-05-20,gemini-2.5-flash-preview-04-17gemini-2.5-flash-lite-preview-06-17gemini-2.0-flash-001,gemini-2.0-flash-lite-001gemini-3-pro-preview
Use gemini_tokenizer::supported_models() to get the full list programmatically.
Provenance and attribution
This crate is a Rust port of the tokenization logic from the official Google Python GenAI SDK (v1.6.20), specifically:
- Text accumulation —
_TextsAccumulatorclass fromgoogle/genai/local_tokenizer.py - Model mapping —
_GEMINI_MODELS_TO_TOKENIZER_NAMESand_GEMINI_STABLE_MODELS_TO_TOKENIZER_NAMESfromgoogle/genai/_local_tokenizer_loader.py - Token-to-bytes conversion —
_token_str_to_bytesand_parse_hex_bytefromgoogle/genai/local_tokenizer.py
The embedded SentencePiece model file (gemma3_cleaned_262144_v2.spiece.model) is from
google/gemma_pytorch at commit
014acb7ac4563a5f77c76d7ff98f31b568c16508, with SHA-256 hash:
1299c11d7cf632ef3b4e11937501358ada021bbdf7c47638d13c0ee982f2e79c
Both upstream projects are licensed under the Apache License, Version 2.0. See the NOTICE file for full attribution details.
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.
This crate contains code derived from googleapis/python-genai (Copyright 2025 Google LLC, Apache-2.0) and embeds a tokenizer model from google/gemma_pytorch (Copyright 2024 Google LLC, Apache-2.0).