ai_tokenopt

Full-spectrum, adaptive token optimization engine for LLM inference pipelines.

Compresses prompts, conversation history, RAG context, tool schemas, tool results, and output streams to minimise token usage while preserving response quality. Delivers 40–60% reduction in both input and output tokens across typical multi-turn conversation flows.

Overview

Large Language Models have finite context windows. As conversations grow, you face a choice: truncate history (losing context) or exceed the budget (degrading quality). ai_tokenopt solves this with a multi-strategy, impact-ordered pipeline that adapts to token pressure in real time.

Compression tiers

Tier	Strategy	Information Loss
1	Lossless — whitespace normalisation, cross-turn RAG dedup	None
2	Extractive — key-sentence extraction from pruned messages	Minimal
3	LLM Fallback — rolling summary via your LLM backend	Low (semantic)

Additional strategies (all active by default)

Output token budgeting — dynamic max_tokens/num_predict from remaining budget
Tool schema compression — shortens descriptions; progressive stripping for already-seen tools
Dynamic tool selection — keyword-overlap relevance scoring, picks top-N tools per query
Tool result compression — extractive truncation of historical tool outputs (~100 tokens)
System prompt trimming — abbreviation and section removal under pressure
Conciseness injection — brevity directive appended to system prompt at >70% pressure
Sampling parameters — configurable repeat_penalty and presence_penalty forwarded to Ollama
Stream repetition detection — terminates degenerate output loops early
Impact-ordered pipeline — estimates savings per strategy, applies highest-impact first
Hardware auto-detection — auto-detects Ollama context window and hardware tier at startup
HuggingFace tokenizer — real BPE tokenisation with graceful heuristic fallback
Runtime prompt overrides — swap compiled-in templates at runtime from a directory
Prometheus metrics — tokenopt_tokens_saved_total, strategy usage counters, reduction ratio
Tracing spans — all major operations instrumented with #[instrument] for Jaeger/OTLP

Quick Start

[dependencies]
ai_tokenopt = "0.5"

Basic usage

use ai_tokenopt::{TokenOptimizer, TokenOptimizationConfig};
use ai_tokenopt::types::Conversation;

let optimizer = TokenOptimizer::new(TokenOptimizationConfig::default());

let mut conv = Conversation::with_system_prompt("You are a helpful assistant.");
conv.add_user_message("What's the weather like?");
conv.add_assistant_message("It's sunny and 22°C today.");
conv.add_user_message("Should I bring an umbrella tomorrow?");

// Optimise (tiers 1–2 only; no LLM required)
let result = optimizer.optimize_conversation(&mut conv, None).await?;
println!("Tokens before: {}", result.estimate_before.total);
println!("Tokens after:  {}", result.estimate_after.total);
// Pass recommended_max_tokens directly to Ollama as `num_predict`
if let Some(max_toks) = result.recommended_max_tokens {
    println!("Recommended max_tokens: {max_toks}");
}
println!("Optimization plan: {} steps, ~{} tokens estimated savings",
    result.plan.steps.len(), result.plan.total_estimated_savings());

With LLM-based summarisation (tier 3)

use ai_tokenopt::ports::SummarizationPort;
use ai_tokenopt::TokenOptError;
use async_trait::async_trait;

struct MyLlmBackend;

#[async_trait]
impl SummarizationPort for MyLlmBackend {
    async fn summarize(&self, system_prompt: &str, text: &str) -> Result<String, TokenOptError> {
        // Call your LLM here (e.g., OpenAI, Ollama, llama.cpp)
        Ok(my_llm_call(system_prompt, text).await?)
    }
}

// Pass to the optimizer for tier-3 compaction
let llm = MyLlmBackend { /* ... */ };
let result = optimizer.optimize_conversation(&mut conv, Some(&llm)).await?;

Tool Optimization

use ai_tokenopt::types::{ToolDefinition, ToolParameters, ParameterProperty};
use std::collections::HashMap;

let tools: Vec<ToolDefinition> = vec![/* your tool definitions */];

// Select and compress the most relevant tools for a query
let optimized_tools = optimizer.optimize_tools("What's the weather?", &tools);
// Returns only the tools relevant to the query, with compressed descriptions

Stream Repetition Detection

use ai_tokenopt::stream::repetition::RepetitionDetector;

let mut detector = RepetitionDetector::new(
    3,    // n-gram size
    0.3,  // threshold: 30% repetition triggers detection
);

// Feed output chunks as they arrive
for chunk in stream {
    match detector.feed(&chunk) {
        RepetitionState::Normal => { /* continue */ },
        RepetitionState::Warning(ratio) => { /* elevated repetition */ },
        RepetitionState::Degenerate => { /* abort stream */ break; },
    }
}

Architecture

┌─────────────────────────────────────────────────┐
│                 TokenOptimizer                   │
│  (orchestrates all optimization components)      │
├──────────┬──────────┬───────────┬───────────────┤
│  Budget  │ History  │  Prompt   │    Tools      │
│ Planner  │Compactor │ Optimizer │  Optimizer    │
│          │          │           │               │
│ allocate │ lossless │ trim to   │ select top-N  │
│ budget   │ extract  │ budget    │ compress      │
│ per-     │ LLM      │ preserve  │ schemas       │
│ component│ fallback │ identity  │               │
└──────────┴──────────┴───────────┴───────────────┘
                        │
              ┌─────────┴─────────┐
              │  Token Estimator  │
              │  (chars ÷ 4       │
              │   heuristic)      │
              └───────────────────┘

Components

Component	Module	Purpose
`TokenOptimizer`	`optimizer`	Central orchestrator; impact-ordered pipeline
`TokenBudget`	`budget`	Allocates context window; adaptive rebalancing
`TokenEstimator`	`estimator`	Heuristic char÷4 counting; HF tokenizer backend
`HistoryCompactor`	`history::compactor`	Lossless → extractive → LLM three-tier compaction
`deduplicate_rag_across_turns`	`prompt::rag_cross_turn_dedup`	Decay-based cross-turn RAG dedup
`optimize_system_prompt`	`prompt::system_prompt`	Pressure-triggered trim + conciseness inject
`compress_old_tool_results`	`tools::result_truncator`	Extractive truncation of historical tool outputs
`ToolUsageTracker`	`tools::progressive`	Progressive schema stripping for seen tools
`RepetitionDetector`	`stream::repetition`	N-gram degenerate output detection
`OptimizationMetrics`	`metrics`	Prometheus counters and gauges
`HfTokenEstimator`	`estimator_hf`	HuggingFace `tokenizers` crate backend
`TemplateLoader`	`prompt::template_loader`	Runtime prompt template loading with fallback

Configuration

All fields are optional with sensible defaults. Deserialise from TOML/JSON via serde:

[token_optimization]
enabled = true                          # Master switch (default: true)
context_window_tokens = 8192            # Match your model's num_ctx (auto-detected at startup)
response_headroom_ratio = 0.25          # Fraction reserved for LLM output (default: 0.25)
compaction_trigger_ratio = 0.70         # Compact at this usage ratio (default: 0.70)
max_summary_tokens = 256                # Rolling summary token budget (default: 256)
system_prompt_budget_ratio = 0.15       # Fraction for system prompt (default: 0.15)
rag_budget_ratio = 0.15                 # Fraction for RAG context (default: 0.15)
repetition_detection_enabled = true     # Monitor output streams (default: true)
repetition_ngram_size = 3               # N-gram size for detection (default: 3)
repetition_threshold = 0.3              # Degenerate threshold (default: 0.3)
max_tools_per_request = 8               # Max tools per LLM call (default: 8)

# v2 features (all enabled by default)
output_max_tokens = 512                 # Hard cap on recommended max_tokens (default: none)
frequency_penalty = 1.1                 # Ollama repeat_penalty (default: none)
presence_penalty = 0.6                  # Ollama presence_penalty (default: none)
progressive_tool_compression = true     # Strip seen tool schemas on repeats (default: true)
conciseness_pressure_threshold = 0.7    # Brevity directive trigger (default: 0.7)
tool_result_max_tokens = 100            # Max tokens for historical tool results (default: 100)
max_history_tokens = 4096               # Token-budget window for history (default: auto)
max_profile_prompt_tokens = 300         # Agent profile section budget (default: 300)
prompt_template_dir = "/etc/pisovereign/prompts"  # Runtime template overrides (default: none)

# HuggingFace tokenizer (requires `hf-tokenizer` feature, on by default)
tokenizer_model = "meta-llama/Llama-3.2-3B"  # Model ID or local path (default: none)

Types

The crate provides its own conversation types that work without any external dependencies:

use ai_tokenopt::types::{
    Conversation,       // Conversation with messages and optional context
    ChatMessage,        // A single message (role + content)
    MessageRole,        // User, Assistant, System, Tool
    ToolDefinition,     // Tool name, description, and parameter schema
    ToolParameters,     // JSON Schema parameters
    ParameterProperty,  // Individual parameter definition
};

These types are minimal and focused — they expose only the fields relevant to token optimization.

Feature Flags

Feature	Default	Description
`pisovereign`	off	Zero-cost integration with PiSovereign's `domain` and `application` crates. Re-exports domain types directly and provides the `TokenOptimizedInferencePort` decorator.
`hf-tokenizer`	on	HuggingFace `tokenizers` crate backend for high-accuracy token counting. Disabling reduces compile time and binary size.
`ollama`	off	Enables HTTP-based `OllamaSummarizationAdapter` for LLM-assisted compaction. Requires `reqwest`.

Impact-Ordered Pipeline

TokenOptimizer::optimize_conversation() executes strategies in descending impact order. High-gain, zero-latency steps run first; expensive LLM operations run only if still necessary:

1. Cross-turn RAG deduplication      ← removes verbatim repeated context blocks
2. Conciseness pressure injection     ← adds brief brevity directive to system prompt
3. Progressive tool compression       ← strips schemas for tools seen in recent turns
4. Historical tool result truncation  ← caps tool output tokens for old messages
5. System prompt trim                 ← abbreviates/removes low-priority sections
6. Extractive history compaction      ← sentence-scored summary + oldest-first prune
7. LLM summarisation fallback         ← async; only if all else insufficient

optimize_conversation_with_tools() additionally:

  ─ Tool relevance scoring & selection ← keyword-overlap ranking
  ─ Schema progressive compression     ← seen tools lose verbose descriptions

The resulting OptimizationResult includes an OptimizationPlan listing which steps actually fired and their estimated savings:

let result = optimizer.optimize_conversation(&conv, None).await?;
for step in &result.plan.steps {
    println!("{}: ~{} tokens saved", step.name, step.estimated_savings);
}
println!("Total savings: {}", result.plan.total_estimated_savings());

How It Works

Token Estimation

Uses the chars ÷ 4 heuristic (~85% accurate for BPE tokenizers on English text). Non-ASCII-heavy text (>30% non-ASCII bytes) uses a more conservative chars ÷ 2.5 ratio. When the hf-tokenizer feature is enabled and a tokenizer_model is configured, the HfTokenEstimator is used for precise per-token counts. Each message adds 4 overhead tokens for role markers.

Budget Allocation

The context window is split across components:

┌─────────────────────────────────────────┐
│ Context Window (e.g., 8192 tokens)      │
├──────────────┬──────────────────────────┤
│ Response     │ Available Budget          │
│ Headroom     ├─────┬──────┬──────┬──────┤
│ (25%)        │ Sys │ RAG  │ Hist │Tools │
│              │ 15% │ 15%  │ rest │ est. │
└──────────────┴─────┴──────┴──────┴──────┘

Adaptive rebalancing: if the system prompt fits in less than its allocated ratio, the surplus is redistributed to history before any compaction fires.

Compaction Strategy

When history exceeds its budget allocation:

Lossless: Collapse redundant whitespace in all messages
Extractive: Remove oldest non-system messages (preserving recent turns), extract key sentences scored by role, position, recency, and information density
LLM Fallback: If a SummarizationPort is provided, generate a concise summary of pruned messages

The rolling summary is stored in conversation.summary and prepended to the remaining history.

Tool Selection

Tools are scored by keyword overlap between the user's query and each tool's name, description, and parameter names. Tools explicitly mentioned by name always rank highest. The top-N tools (configurable via max_tools_per_request) are returned with compressed descriptions.

Progressive compression reduces schema verbosity for tools that appeared in recent turns: full schema on first use, parameters only on second, name only thereafter.

Runtime Prompt Template Overrides

The TemplateLoader resolves prompt templates at runtime from a configurable directory, falling back to the compiled-in defaults from YAML_PROMPTS:

use ai_tokenopt::TemplateLoader;

// Filesystem-first loader — set in config or pass None for compiled defaults
let loader = TemplateLoader::new(Some("/etc/pisovereign/prompts"));

// Returns Some(String) from disk, or the built-in template as fallback
if let Some(template) = loader.load("summarize") {
    // use custom template
}

Template files must be named <name>.prompt.txt. Place them in the directory configured via prompt_template_dir to override built-in prompts without recompiling.

Performance

Zero allocations for conversations within budget (fast-path bypass)
O(n) token estimation (single pass over message bytes)
Optional tokenizer — the heuristic estimator works with any model out of the box
Async-ready — all optimization methods are async for LLM fallback compatibility
Instrumented — all optimizer methods emit tracing spans for observability

Benchmarks

Run the Criterion benchmark suite:

cargo bench -p ai_tokenopt

Five benchmark groups:

Group	Measures
`token_estimation`	Heuristic estimator throughput across text types and lengths
`budget_allocation`	Budget split across message counts (5–200 messages)
`tool_compression`	Semantic tool selection; schema compression (5–50 tools)
`history_compaction`	Full pipeline with forced compaction (10–200 messages)
`full_pipeline`	End-to-end `optimize_conversation` and `optimize_with_tools`

Results are written to target/criterion/ with HTML reports.

License

MIT — see LICENSE for details.

Contributing

This crate is part of the PiSovereign project. Contributions welcome via pull requests.

ai_tokenopt 0.5.6