ai_tokenopt
Full-spectrum, adaptive token optimization engine for LLM inference pipelines.
Compresses prompts, conversation history, RAG context, tool schemas, tool results, and output streams to minimise token usage while preserving response quality. Delivers 40–60% reduction in both input and output tokens across typical multi-turn conversation flows.
Overview
Large Language Models have finite context windows. As conversations grow, you face a choice:
truncate history (losing context) or exceed the budget (degrading quality). ai_tokenopt solves
this with a multi-strategy, impact-ordered pipeline that adapts to token pressure in real time.
Compression tiers
| Tier | Strategy | Information Loss |
|---|---|---|
| 1 | Lossless — whitespace normalisation, cross-turn RAG dedup | None |
| 2 | Extractive — key-sentence extraction from pruned messages | Minimal |
| 3 | LLM Fallback — rolling summary via your LLM backend | Low (semantic) |
Additional strategies (all active by default)
- Output token budgeting — dynamic
max_tokens/num_predictfrom remaining budget - Tool schema compression — shortens descriptions; progressive stripping for already-seen tools
- Dynamic tool selection — keyword-overlap relevance scoring, picks top-N tools per query
- Tool result compression — extractive truncation of historical tool outputs (~100 tokens)
- System prompt trimming — abbreviation and section removal under pressure
- Conciseness injection — brevity directive appended to system prompt at >70% pressure
- Sampling parameters — configurable
repeat_penaltyandpresence_penaltyforwarded to Ollama - Stream repetition detection — terminates degenerate output loops early
- Impact-ordered pipeline — estimates savings per strategy, applies highest-impact first
- Hardware auto-detection — auto-detects Ollama context window and hardware tier at startup
- HuggingFace tokenizer — real BPE tokenisation with graceful heuristic fallback
- Runtime prompt overrides — swap compiled-in templates at runtime from a directory
- Prometheus metrics —
tokenopt_tokens_saved_total, strategy usage counters, reduction ratio - Tracing spans — all major operations instrumented with
#[instrument]for Jaeger/OTLP
Quick Start
[]
= "0.5"
Basic usage
use ;
use Conversation;
let optimizer = new;
let mut conv = with_system_prompt;
conv.add_user_message;
conv.add_assistant_message;
conv.add_user_message;
// Optimise (tiers 1–2 only; no LLM required)
let result = optimizer.optimize_conversation.await?;
println!;
println!;
// Pass recommended_max_tokens directly to Ollama as `num_predict`
if let Some = result.recommended_max_tokens
println!;
With LLM-based summarisation (tier 3)
use SummarizationPort;
use TokenOptError;
use async_trait;
;
// Pass to the optimizer for tier-3 compaction
let llm = MyLlmBackend ;
let result = optimizer.optimize_conversation.await?;
Tool Optimization
use ;
use HashMap;
let tools: = vec!;
// Select and compress the most relevant tools for a query
let optimized_tools = optimizer.optimize_tools;
// Returns only the tools relevant to the query, with compressed descriptions
Stream Repetition Detection
use RepetitionDetector;
let mut detector = new;
// Feed output chunks as they arrive
for chunk in stream
Architecture
┌─────────────────────────────────────────────────┐
│ TokenOptimizer │
│ (orchestrates all optimization components) │
├──────────┬──────────┬───────────┬───────────────┤
│ Budget │ History │ Prompt │ Tools │
│ Planner │Compactor │ Optimizer │ Optimizer │
│ │ │ │ │
│ allocate │ lossless │ trim to │ select top-N │
│ budget │ extract │ budget │ compress │
│ per- │ LLM │ preserve │ schemas │
│ component│ fallback │ identity │ │
└──────────┴──────────┴───────────┴───────────────┘
│
┌─────────┴─────────┐
│ Token Estimator │
│ (chars ÷ 4 │
│ heuristic) │
└───────────────────┘
Components
| Component | Module | Purpose |
|---|---|---|
TokenOptimizer |
optimizer |
Central orchestrator; impact-ordered pipeline |
TokenBudget |
budget |
Allocates context window; adaptive rebalancing |
TokenEstimator |
estimator |
Heuristic char÷4 counting; HF tokenizer backend |
HistoryCompactor |
history::compactor |
Lossless → extractive → LLM three-tier compaction |
deduplicate_rag_across_turns |
prompt::rag_cross_turn_dedup |
Decay-based cross-turn RAG dedup |
optimize_system_prompt |
prompt::system_prompt |
Pressure-triggered trim + conciseness inject |
compress_old_tool_results |
tools::result_truncator |
Extractive truncation of historical tool outputs |
ToolUsageTracker |
tools::progressive |
Progressive schema stripping for seen tools |
RepetitionDetector |
stream::repetition |
N-gram degenerate output detection |
OptimizationMetrics |
metrics |
Prometheus counters and gauges |
HfTokenEstimator |
estimator_hf |
HuggingFace tokenizers crate backend |
TemplateLoader |
prompt::template_loader |
Runtime prompt template loading with fallback |
Configuration
All fields are optional with sensible defaults. Deserialise from TOML/JSON via serde:
[]
= true # Master switch (default: true)
= 8192 # Match your model's num_ctx (auto-detected at startup)
= 0.25 # Fraction reserved for LLM output (default: 0.25)
= 0.70 # Compact at this usage ratio (default: 0.70)
= 256 # Rolling summary token budget (default: 256)
= 0.15 # Fraction for system prompt (default: 0.15)
= 0.15 # Fraction for RAG context (default: 0.15)
= true # Monitor output streams (default: true)
= 3 # N-gram size for detection (default: 3)
= 0.3 # Degenerate threshold (default: 0.3)
= 8 # Max tools per LLM call (default: 8)
# v2 features (all enabled by default)
= 512 # Hard cap on recommended max_tokens (default: none)
= 1.1 # Ollama repeat_penalty (default: none)
= 0.6 # Ollama presence_penalty (default: none)
= true # Strip seen tool schemas on repeats (default: true)
= 0.7 # Brevity directive trigger (default: 0.7)
= 100 # Max tokens for historical tool results (default: 100)
= 4096 # Token-budget window for history (default: auto)
= 300 # Agent profile section budget (default: 300)
= "/etc/pisovereign/prompts" # Runtime template overrides (default: none)
# HuggingFace tokenizer (requires `hf-tokenizer` feature, on by default)
= "meta-llama/Llama-3.2-3B" # Model ID or local path (default: none)
Types
The crate provides its own conversation types that work without any external dependencies:
use ;
These types are minimal and focused — they expose only the fields relevant to token optimization.
Feature Flags
| Feature | Default | Description |
|---|---|---|
pisovereign |
off | Zero-cost integration with PiSovereign's domain and application crates. Re-exports domain types directly and provides the TokenOptimizedInferencePort decorator. |
hf-tokenizer |
on | HuggingFace tokenizers crate backend for high-accuracy token counting. Disabling reduces compile time and binary size. |
ollama |
off | Enables HTTP-based OllamaSummarizationAdapter for LLM-assisted compaction. Requires reqwest. |
Impact-Ordered Pipeline
TokenOptimizer::optimize_conversation() executes strategies in descending impact order. High-gain, zero-latency steps run first; expensive LLM operations run only if still necessary:
1. Cross-turn RAG deduplication ← removes verbatim repeated context blocks
2. Conciseness pressure injection ← adds brief brevity directive to system prompt
3. Progressive tool compression ← strips schemas for tools seen in recent turns
4. Historical tool result truncation ← caps tool output tokens for old messages
5. System prompt trim ← abbreviates/removes low-priority sections
6. Extractive history compaction ← sentence-scored summary + oldest-first prune
7. LLM summarisation fallback ← async; only if all else insufficient
optimize_conversation_with_tools() additionally:
─ Tool relevance scoring & selection ← keyword-overlap ranking
─ Schema progressive compression ← seen tools lose verbose descriptions
The resulting OptimizationResult includes an OptimizationPlan listing which steps
actually fired and their estimated savings:
let result = optimizer.optimize_conversation.await?;
for step in &result.plan.steps
println!;
How It Works
Token Estimation
Uses the chars ÷ 4 heuristic (~85% accurate for BPE tokenizers on English text). Non-ASCII-heavy text (>30% non-ASCII bytes) uses a more conservative chars ÷ 2.5 ratio. When the hf-tokenizer feature is enabled and a tokenizer_model is configured, the HfTokenEstimator is used for precise per-token counts. Each message adds 4 overhead tokens for role markers.
Budget Allocation
The context window is split across components:
┌─────────────────────────────────────────┐
│ Context Window (e.g., 8192 tokens) │
├──────────────┬──────────────────────────┤
│ Response │ Available Budget │
│ Headroom ├─────┬──────┬──────┬──────┤
│ (25%) │ Sys │ RAG │ Hist │Tools │
│ │ 15% │ 15% │ rest │ est. │
└──────────────┴─────┴──────┴──────┴──────┘
Adaptive rebalancing: if the system prompt fits in less than its allocated ratio, the surplus is redistributed to history before any compaction fires.
Compaction Strategy
When history exceeds its budget allocation:
- Lossless: Collapse redundant whitespace in all messages
- Extractive: Remove oldest non-system messages (preserving recent turns), extract key sentences scored by role, position, recency, and information density
- LLM Fallback: If a
SummarizationPortis provided, generate a concise summary of pruned messages
The rolling summary is stored in conversation.summary and prepended to the remaining history.
Tool Selection
Tools are scored by keyword overlap between the user's query and each tool's name, description, and parameter names. Tools explicitly mentioned by name always rank highest. The top-N tools (configurable via max_tools_per_request) are returned with compressed descriptions.
Progressive compression reduces schema verbosity for tools that appeared in recent turns: full schema on first use, parameters only on second, name only thereafter.
Runtime Prompt Template Overrides
The TemplateLoader resolves prompt templates at runtime from a configurable directory,
falling back to the compiled-in defaults from YAML_PROMPTS:
use TemplateLoader;
// Filesystem-first loader — set in config or pass None for compiled defaults
let loader = new;
// Returns Some(String) from disk, or the built-in template as fallback
if let Some = loader.load
Template files must be named <name>.prompt.txt. Place them in the directory configured
via prompt_template_dir to override built-in prompts without recompiling.
Performance
- Zero allocations for conversations within budget (fast-path bypass)
- O(n) token estimation (single pass over message bytes)
- Optional tokenizer — the heuristic estimator works with any model out of the box
- Async-ready — all optimization methods are
asyncfor LLM fallback compatibility - Instrumented — all optimizer methods emit
tracingspans for observability
Benchmarks
Run the Criterion benchmark suite:
Five benchmark groups:
| Group | Measures |
|---|---|
token_estimation |
Heuristic estimator throughput across text types and lengths |
budget_allocation |
Budget split across message counts (5–200 messages) |
tool_compression |
Semantic tool selection; schema compression (5–50 tools) |
history_compaction |
Full pipeline with forced compaction (10–200 messages) |
full_pipeline |
End-to-end optimize_conversation and optimize_with_tools |
Results are written to target/criterion/ with HTML reports.
License
MIT — see LICENSE for details.
Contributing
This crate is part of the PiSovereign project. Contributions welcome via pull requests.