pub struct TokenUsage {
pub input_tokens: u64,
pub output_tokens: u64,
pub model: String,
pub provider: String,
pub cached_input_tokens: u64,
pub cache_write_input_tokens: u64,
pub cache_write_5m_input_tokens: u64,
pub cache_write_1h_input_tokens: u64,
pub stop_reason: Option<String>,
pub raw_stop_reason: Option<String>,
pub reasoning_tokens: u64,
}Expand description
Token usage from a single LLM call.
§Normalized superset convention
input_tokens is the total number of input tokens processed — it
is a superset of cached_input_tokens and cache_write_input_tokens.
Downstream cost logic derives the “fresh” (non-cached) portion by
subtracting the two cache counts. This matches OpenAI and Gemini’s
native reporting; Anthropic’s API reports the three groups as disjoint
so the Anthropic parser normalizes by summing before assigning.
§Prompt-caching semantics per provider
- OpenAI —
cached_input_tokenscounts cache READS (billed at a discount, typically 0.1x base input). Cache writes are free per OpenAI’s caching docs.cache_write_input_tokensis always 0. - Anthropic — the API returns three token groups:
input_tokens(fresh, 1x),cache_read_input_tokens(0.1x), andcache_creation_input_tokens(1.25x at the default 5-minute TTL, 2.0x at 1-hour TTL). The parser remaps these to the superset convention above. The per-TTL split is surfaced ascache_write_5m_input_tokensandcache_write_1h_input_tokens(parsed fromusage.cache_creation.ephemeral_5m_input_tokens/ephemeral_1h_input_tokens); these two fields sum tocache_write_input_tokens. Akribes workflows opt into the 1h TTL via theextended-cache-ttl-2025-04-11beta header, so this split matters for cost accounting (#1091). - Gemini — only cache reads are reported; writes are not
separately billed.
cache_write_input_tokensis always 0.
Fields§
§input_tokens: u64Total input tokens processed (superset of the two cache counts).
output_tokens: u64§model: String§provider: String§cached_input_tokens: u64Cache-READ tokens (billed at CACHE_READ_RATE, ~0.1x input).
cache_write_input_tokens: u64Cache-WRITE / creation tokens (Anthropic only today; billed at
CACHE_WRITE_RATE, 1.25x input at 5m TTL or 2.0x at 1h TTL).
This is the total across both TTL buckets; the breakdown
lives on Self::cache_write_5m_input_tokens and
Self::cache_write_1h_input_tokens (#1091). Serialized
default for backward-compatibility with events predating this
field.
cache_write_5m_input_tokens: u64Anthropic cache-WRITE tokens at the default 5-minute TTL,
parsed from usage.cache_creation.ephemeral_5m_input_tokens.
Subset of Self::cache_write_input_tokens — sums with
Self::cache_write_1h_input_tokens to the total. 0 on
providers that don’t report the per-TTL breakdown (OpenAI,
Gemini, mock) and for pre-#1091 events that omit the field.
cache_write_1h_input_tokens: u64Anthropic cache-WRITE tokens at the 1-hour TTL, parsed from
usage.cache_creation.ephemeral_1h_input_tokens. Subset of
Self::cache_write_input_tokens — sums with
Self::cache_write_5m_input_tokens to the total. 0 on
providers without per-TTL reporting (OpenAI, Gemini, mock) and
for pre-#1091 events that omit the field. The 1h-TTL bucket
bills at 2.0x base input vs. 1.25x for 5m — pricing::compute_cost
uses this split for accurate cost attribution (#1091).
stop_reason: Option<String>The provider-reported stop reason for the underlying call, when
known. Anthropic surfaces values like "end_turn", "max_tokens",
"tool_use", "stop_sequence". OpenAI: "stop", "length",
"tool_calls". Gemini: "STOP", "MAX_TOKENS", etc.
Carried alongside usage so the engine’s validation-failure path can
distinguish “model truncated mid-output” (max_tokens / length /
MAX_TOKENS) from “model finished cleanly but produced an
invalid shape” — see issue #320 / #321. None for providers that
don’t surface a stop reason or for paths that haven’t been threaded
(e.g. the mock provider). Serialized with #[serde(default)] so old
wire payloads that omit the field still deserialize.
Today this field carries the RAW provider value when the
parse_*_usage path produced the TokenUsage (the common case
for non-streamed calls). The usage_from_outcome rebuild path
(streaming + some retry paths) writes the OTel-canonical form
("stop" / "max_tokens" / "tool_use" / "content_filter" /
"other") because LlmCallOutcome only carries the canonical
form. Consumers that need a deterministic-by-provider raw value
should prefer Self::raw_stop_reason (#1077).
raw_stop_reason: Option<String>Raw provider stop reason, never lossy-mapped to OTel canonical
form. Set to the same value as Self::stop_reason when the
parse_*_usage path produced the usage; None otherwise
(mock, streaming rebuilds via usage_from_outcome).
Bench / observability code that needs to distinguish Gemini’s
"STOP" from "RECITATION" (both collapse to "stop" under
the canonical mapping) or Anthropic’s "stop_sequence" from
"end_turn" should read this field. #1077.
reasoning_tokens: u64Reasoning / thinking tokens — a SUBSET of Self::output_tokens,
not in addition. Captured from:
- OpenAI o-series + GPT-5:
usage.completion_tokens_details.reasoning_tokens - Anthropic extended-thinking:
usage.thinking_tokens(when present) - Gemini with
thinkingBudgetset:usageMetadata.thoughtsTokenCount
0 when the model didn’t engage reasoning or the provider didn’t
surface the breakdown. #[serde(default)] keeps wire-compat with
pre-#322 events that omit the field entirely.
Trait Implementations§
Source§impl Clone for TokenUsage
impl Clone for TokenUsage
Source§fn clone(&self) -> TokenUsage
fn clone(&self) -> TokenUsage
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more