# ai_tokenopt
**Full-spectrum, adaptive token optimization engine for LLM inference pipelines.**
Compresses prompts, conversation history, RAG context, tool schemas, tool results, and output
streams to minimise token usage while preserving response quality. Delivers **40–60% reduction**
in both input and output tokens across typical multi-turn conversation flows.
[](https://crates.io/crates/ai_tokenopt)
[](https://opensource.org/licenses/MIT)
---
## Overview
Large Language Models have finite context windows. As conversations grow, you face a choice:
truncate history (losing context) or exceed the budget (degrading quality). `ai_tokenopt` solves
this with a multi-strategy, impact-ordered pipeline that adapts to token pressure in real time.
### Compression tiers
| 1 | **Lossless** — whitespace normalisation, cross-turn RAG dedup | None |
| 2 | **Extractive** — key-sentence extraction from pruned messages | Minimal |
| 3 | **LLM Fallback** — rolling summary via your LLM backend | Low (semantic) |
### Additional strategies (all active by default)
- **Output token budgeting** — dynamic `max_tokens`/`num_predict` from remaining budget
- **Tool schema compression** — shortens descriptions; progressive stripping for already-seen tools
- **Dynamic tool selection** — keyword-overlap relevance scoring, picks top-N tools per query
- **Tool result compression** — extractive truncation of historical tool outputs (~100 tokens)
- **System prompt trimming** — abbreviation and section removal under pressure
- **Conciseness injection** — brevity directive appended to system prompt at >70% pressure
- **Sampling parameters** — configurable `repeat_penalty` and `presence_penalty` forwarded to Ollama
- **Stream repetition detection** — terminates degenerate output loops early
- **Impact-ordered pipeline** — estimates savings per strategy, applies highest-impact first
- **Hardware auto-detection** — auto-detects Ollama context window and hardware tier at startup
- **HuggingFace tokenizer** — real BPE tokenisation with graceful heuristic fallback
- **Runtime prompt overrides** — swap compiled-in templates at runtime from a directory
- **Prometheus metrics** — `tokenopt_tokens_saved_total`, strategy usage counters, reduction ratio
- **Tracing spans** — all major operations instrumented with `#[instrument]` for Jaeger/OTLP
## Quick Start
```toml
[dependencies]
ai_tokenopt = "0.5"
```
### Basic usage
```rust
use ai_tokenopt::{TokenOptimizer, TokenOptimizationConfig};
use ai_tokenopt::types::Conversation;
let optimizer = TokenOptimizer::new(TokenOptimizationConfig::default());
let mut conv = Conversation::with_system_prompt("You are a helpful assistant.");
conv.add_user_message("What's the weather like?");
conv.add_assistant_message("It's sunny and 22°C today.");
conv.add_user_message("Should I bring an umbrella tomorrow?");
// Optimise (tiers 1–2 only; no LLM required)
let result = optimizer.optimize_conversation(&mut conv, None).await?;
println!("Tokens before: {}", result.estimate_before.total);
println!("Tokens after: {}", result.estimate_after.total);
// Pass recommended_max_tokens directly to Ollama as `num_predict`
if let Some(max_toks) = result.recommended_max_tokens {
println!("Recommended max_tokens: {max_toks}");
}
println!("Optimization plan: {} steps, ~{} tokens estimated savings",
result.plan.steps.len(), result.plan.total_estimated_savings());
```
### With LLM-based summarisation (tier 3)
```rust
use ai_tokenopt::ports::SummarizationPort;
use ai_tokenopt::TokenOptError;
use async_trait::async_trait;
struct MyLlmBackend;
#[async_trait]
impl SummarizationPort for MyLlmBackend {
async fn summarize(&self, system_prompt: &str, text: &str) -> Result<String, TokenOptError> {
// Call your LLM here (e.g., OpenAI, Ollama, llama.cpp)
Ok(my_llm_call(system_prompt, text).await?)
}
}
// Pass to the optimizer for tier-3 compaction
let llm = MyLlmBackend { /* ... */ };
let result = optimizer.optimize_conversation(&mut conv, Some(&llm)).await?;
```
### Tool Optimization
```rust
use ai_tokenopt::types::{ToolDefinition, ToolParameters, ParameterProperty};
use std::collections::HashMap;
let tools: Vec<ToolDefinition> = vec![/* your tool definitions */];
// Select and compress the most relevant tools for a query
let optimized_tools = optimizer.optimize_tools("What's the weather?", &tools);
// Returns only the tools relevant to the query, with compressed descriptions
```
### Stream Repetition Detection
```rust
use ai_tokenopt::stream::repetition::RepetitionDetector;
let mut detector = RepetitionDetector::new(
3, // n-gram size
0.3, // threshold: 30% repetition triggers detection
);
// Feed output chunks as they arrive
for chunk in stream {
match detector.feed(&chunk) {
RepetitionState::Normal => { /* continue */ },
RepetitionState::Warning(ratio) => { /* elevated repetition */ },
RepetitionState::Degenerate => { /* abort stream */ break; },
}
}
```
## Architecture
```text
┌─────────────────────────────────────────────────┐
│ TokenOptimizer │
│ (orchestrates all optimization components) │
├──────────┬──────────┬───────────┬───────────────┤
│ Budget │ History │ Prompt │ Tools │
│ Planner │Compactor │ Optimizer │ Optimizer │
│ │ │ │ │
│ allocate │ lossless │ trim to │ select top-N │
│ budget │ extract │ budget │ compress │
│ per- │ LLM │ preserve │ schemas │
│ component│ fallback │ identity │ │
└──────────┴──────────┴───────────┴───────────────┘
│
┌─────────┴─────────┐
│ Token Estimator │
│ (chars ÷ 4 │
│ heuristic) │
└───────────────────┘
```
### Components
| `TokenOptimizer` | `optimizer` | Central orchestrator; impact-ordered pipeline |
| `TokenBudget` | `budget` | Allocates context window; adaptive rebalancing |
| `TokenEstimator` | `estimator` | Heuristic char÷4 counting; HF tokenizer backend |
| `HistoryCompactor` | `history::compactor` | Lossless → extractive → LLM three-tier compaction |
| `deduplicate_rag_across_turns` | `prompt::rag_cross_turn_dedup` | Decay-based cross-turn RAG dedup |
| `optimize_system_prompt` | `prompt::system_prompt` | Pressure-triggered trim + conciseness inject |
| `compress_old_tool_results` | `tools::result_truncator` | Extractive truncation of historical tool outputs |
| `ToolUsageTracker` | `tools::progressive` | Progressive schema stripping for seen tools |
| `RepetitionDetector` | `stream::repetition` | N-gram degenerate output detection |
| `OptimizationMetrics` | `metrics` | Prometheus counters and gauges |
| `HfTokenEstimator` | `estimator_hf` | HuggingFace `tokenizers` crate backend |
| `TemplateLoader` | `prompt::template_loader` | Runtime prompt template loading with fallback |
## Configuration
All fields are optional with sensible defaults. Deserialise from TOML/JSON via `serde`:
```toml
[token_optimization]
enabled = true # Master switch (default: true)
context_window_tokens = 8192 # Match your model's num_ctx (auto-detected at startup)
response_headroom_ratio = 0.25 # Fraction reserved for LLM output (default: 0.25)
compaction_trigger_ratio = 0.70 # Compact at this usage ratio (default: 0.70)
max_summary_tokens = 256 # Rolling summary token budget (default: 256)
system_prompt_budget_ratio = 0.15 # Fraction for system prompt (default: 0.15)
rag_budget_ratio = 0.15 # Fraction for RAG context (default: 0.15)
repetition_detection_enabled = true # Monitor output streams (default: true)
repetition_ngram_size = 3 # N-gram size for detection (default: 3)
repetition_threshold = 0.3 # Degenerate threshold (default: 0.3)
max_tools_per_request = 8 # Max tools per LLM call (default: 8)
# v2 features (all enabled by default)
output_max_tokens = 512 # Hard cap on recommended max_tokens (default: none)
frequency_penalty = 1.1 # Ollama repeat_penalty (default: none)
presence_penalty = 0.6 # Ollama presence_penalty (default: none)
progressive_tool_compression = true # Strip seen tool schemas on repeats (default: true)
conciseness_pressure_threshold = 0.7 # Brevity directive trigger (default: 0.7)
tool_result_max_tokens = 100 # Max tokens for historical tool results (default: 100)
max_history_tokens = 4096 # Token-budget window for history (default: auto)
max_profile_prompt_tokens = 300 # Agent profile section budget (default: 300)
prompt_template_dir = "/etc/pisovereign/prompts" # Runtime template overrides (default: none)
# HuggingFace tokenizer (requires `hf-tokenizer` feature, on by default)
tokenizer_model = "meta-llama/Llama-3.2-3B" # Model ID or local path (default: none)
```
## Types
The crate provides its own conversation types that work without any external dependencies:
```rust
use ai_tokenopt::types::{
Conversation, // Conversation with messages and optional context
ChatMessage, // A single message (role + content)
MessageRole, // User, Assistant, System, Tool
ToolDefinition, // Tool name, description, and parameter schema
ToolParameters, // JSON Schema parameters
ParameterProperty, // Individual parameter definition
};
```
These types are minimal and focused — they expose only the fields relevant to token optimization.
## Feature Flags
| `pisovereign` | off | Zero-cost integration with PiSovereign's `domain` and `application` crates. Re-exports domain types directly and provides the `TokenOptimizedInferencePort` decorator. |
| `hf-tokenizer` | on | HuggingFace `tokenizers` crate backend for high-accuracy token counting. Disabling reduces compile time and binary size. |
| `ollama` | off | Enables HTTP-based `OllamaSummarizationAdapter` for LLM-assisted compaction. Requires `reqwest`. |
## Impact-Ordered Pipeline
`TokenOptimizer::optimize_conversation()` executes strategies in descending impact order. High-gain, zero-latency steps run first; expensive LLM operations run only if still necessary:
```text
1. Cross-turn RAG deduplication ← removes verbatim repeated context blocks
2. Conciseness pressure injection ← adds brief brevity directive to system prompt
3. Progressive tool compression ← strips schemas for tools seen in recent turns
4. Historical tool result truncation ← caps tool output tokens for old messages
5. System prompt trim ← abbreviates/removes low-priority sections
6. Extractive history compaction ← sentence-scored summary + oldest-first prune
7. LLM summarisation fallback ← async; only if all else insufficient
```
`optimize_conversation_with_tools()` additionally:
```text
─ Tool relevance scoring & selection ← keyword-overlap ranking
─ Schema progressive compression ← seen tools lose verbose descriptions
```
The resulting `OptimizationResult` includes an `OptimizationPlan` listing which steps
actually fired and their estimated savings:
```rust
let result = optimizer.optimize_conversation(&conv, None).await?;
for step in &result.plan.steps {
println!("{}: ~{} tokens saved", step.name, step.estimated_savings);
}
println!("Total savings: {}", result.plan.total_estimated_savings());
```
## How It Works
### Token Estimation
Uses the `chars ÷ 4` heuristic (~85% accurate for BPE tokenizers on English text). Non-ASCII-heavy text (>30% non-ASCII bytes) uses a more conservative `chars ÷ 2.5` ratio. When the `hf-tokenizer` feature is enabled and a `tokenizer_model` is configured, the `HfTokenEstimator` is used for precise per-token counts. Each message adds 4 overhead tokens for role markers.
### Budget Allocation
The context window is split across components:
```text
┌─────────────────────────────────────────┐
│ Context Window (e.g., 8192 tokens) │
├──────────────┬──────────────────────────┤
│ Response │ Available Budget │
│ Headroom ├─────┬──────┬──────┬──────┤
│ (25%) │ Sys │ RAG │ Hist │Tools │
│ │ 15% │ 15% │ rest │ est. │
└──────────────┴─────┴──────┴──────┴──────┘
```
Adaptive rebalancing: if the system prompt fits in less than its allocated ratio, the
surplus is redistributed to history before any compaction fires.
### Compaction Strategy
When history exceeds its budget allocation:
1. **Lossless**: Collapse redundant whitespace in all messages
2. **Extractive**: Remove oldest non-system messages (preserving recent turns), extract key sentences scored by role, position, recency, and information density
3. **LLM Fallback**: If a `SummarizationPort` is provided, generate a concise summary of pruned messages
The rolling summary is stored in `conversation.summary` and prepended to the remaining history.
### Tool Selection
Tools are scored by keyword overlap between the user's query and each tool's name, description, and parameter names. Tools explicitly mentioned by name always rank highest. The top-N tools (configurable via `max_tools_per_request`) are returned with compressed descriptions.
Progressive compression reduces schema verbosity for tools that appeared in recent turns:
full schema on first use, parameters only on second, name only thereafter.
## Runtime Prompt Template Overrides
The `TemplateLoader` resolves prompt templates at runtime from a configurable directory,
falling back to the compiled-in defaults from `YAML_PROMPTS`:
```rust
use ai_tokenopt::TemplateLoader;
// Filesystem-first loader — set in config or pass None for compiled defaults
let loader = TemplateLoader::new(Some("/etc/pisovereign/prompts"));
// Returns Some(String) from disk, or the built-in template as fallback
if let Some(template) = loader.load("summarize") {
// use custom template
}
```
Template files must be named `<name>.prompt.txt`. Place them in the directory configured
via `prompt_template_dir` to override built-in prompts without recompiling.
## Performance
- **Zero allocations** for conversations within budget (fast-path bypass)
- **O(n)** token estimation (single pass over message bytes)
- **Optional tokenizer** — the heuristic estimator works with any model out of the box
- **Async-ready** — all optimization methods are `async` for LLM fallback compatibility
- **Instrumented** — all optimizer methods emit `tracing` spans for observability
## Benchmarks
Run the Criterion benchmark suite:
```bash
cargo bench -p ai_tokenopt
```
Five benchmark groups:
| `token_estimation` | Heuristic estimator throughput across text types and lengths |
| `budget_allocation` | Budget split across message counts (5–200 messages) |
| `tool_compression` | Semantic tool selection; schema compression (5–50 tools) |
| `history_compaction` | Full pipeline with forced compaction (10–200 messages) |
| `full_pipeline` | End-to-end `optimize_conversation` and `optimize_with_tools` |
Results are written to `target/criterion/` with HTML reports.
## License
MIT — see [LICENSE](../../LICENSE) for details.
## Contributing
This crate is part of the [PiSovereign](https://github.com/andreasreichel/PiSovereign) project. Contributions welcome via pull requests.