ai_tokenopt 0.5.7

Adaptive token optimization engine for LLM inference pipelines — compresses prompts, conversation history, tool schemas, and output streams to minimize token usage while preserving response quality.
Documentation

ai_tokenopt

Full-spectrum, adaptive token optimization engine for LLM inference pipelines.

Compresses prompts, conversation history, RAG context, tool schemas, tool results, and output streams to minimise token usage while preserving response quality. Delivers 40–60% reduction in both input and output tokens across typical multi-turn conversation flows.

Crates.io License: MIT


Overview

Large Language Models have finite context windows. As conversations grow, you face a choice: truncate history (losing context) or exceed the budget (degrading quality). ai_tokenopt solves this with a multi-strategy, impact-ordered pipeline that adapts to token pressure in real time.

Compression tiers

Tier Strategy Information Loss
1 Lossless — whitespace normalisation, cross-turn RAG dedup None
2 Extractive — key-sentence extraction from pruned messages Minimal
3 LLM Fallback — rolling summary via your LLM backend Low (semantic)

Additional strategies (all active by default)

  • Output token budgeting — dynamic max_tokens/num_predict from remaining budget
  • Tool schema compression — shortens descriptions; progressive stripping for already-seen tools
  • Dynamic tool selection — keyword-overlap relevance scoring, picks top-N tools per query
  • Tool result compression — extractive truncation of historical tool outputs (~100 tokens)
  • System prompt trimming — abbreviation and section removal under pressure
  • Conciseness injection — brevity directive appended to system prompt at >70% pressure
  • Sampling parameters — configurable repeat_penalty and presence_penalty forwarded to Ollama
  • Stream repetition detection — terminates degenerate output loops early
  • Impact-ordered pipeline — estimates savings per strategy, applies highest-impact first
  • Hardware auto-detection — auto-detects Ollama context window and hardware tier at startup
  • HuggingFace tokenizer — real BPE tokenisation with graceful heuristic fallback
  • Runtime prompt overrides — swap compiled-in templates at runtime from a directory
  • Prometheus metricstokenopt_tokens_saved_total, strategy usage counters, reduction ratio
  • Tracing spans — all major operations instrumented with #[instrument] for Jaeger/OTLP

Quick Start

[dependencies]
ai_tokenopt = "0.5"

Basic usage

use ai_tokenopt::{TokenOptimizer, TokenOptimizationConfig};
use ai_tokenopt::types::Conversation;

let optimizer = TokenOptimizer::new(TokenOptimizationConfig::default());

let mut conv = Conversation::with_system_prompt("You are a helpful assistant.");
conv.add_user_message("What's the weather like?");
conv.add_assistant_message("It's sunny and 22°C today.");
conv.add_user_message("Should I bring an umbrella tomorrow?");

// Optimise (tiers 1–2 only; no LLM required)
let result = optimizer.optimize_conversation(&mut conv, None).await?;
println!("Tokens before: {}", result.estimate_before.total);
println!("Tokens after:  {}", result.estimate_after.total);
// Pass recommended_max_tokens directly to Ollama as `num_predict`
if let Some(max_toks) = result.recommended_max_tokens {
    println!("Recommended max_tokens: {max_toks}");
}
println!("Optimization plan: {} steps, ~{} tokens estimated savings",
    result.plan.steps.len(), result.plan.total_estimated_savings());

With LLM-based summarisation (tier 3)

use ai_tokenopt::ports::SummarizationPort;
use ai_tokenopt::TokenOptError;
use async_trait::async_trait;

struct MyLlmBackend;

#[async_trait]
impl SummarizationPort for MyLlmBackend {
    async fn summarize(&self, system_prompt: &str, text: &str) -> Result<String, TokenOptError> {
        // Call your LLM here (e.g., OpenAI, Ollama, llama.cpp)
        Ok(my_llm_call(system_prompt, text).await?)
    }
}

// Pass to the optimizer for tier-3 compaction
let llm = MyLlmBackend { /* ... */ };
let result = optimizer.optimize_conversation(&mut conv, Some(&llm)).await?;

Tool Optimization

use ai_tokenopt::types::{ToolDefinition, ToolParameters, ParameterProperty};
use std::collections::HashMap;

let tools: Vec<ToolDefinition> = vec![/* your tool definitions */];

// Select and compress the most relevant tools for a query
let optimized_tools = optimizer.optimize_tools("What's the weather?", &tools);
// Returns only the tools relevant to the query, with compressed descriptions

Stream Repetition Detection

use ai_tokenopt::stream::repetition::RepetitionDetector;

let mut detector = RepetitionDetector::new(
    3,    // n-gram size
    0.3,  // threshold: 30% repetition triggers detection
);

// Feed output chunks as they arrive
for chunk in stream {
    match detector.feed(&chunk) {
        RepetitionState::Normal => { /* continue */ },
        RepetitionState::Warning(ratio) => { /* elevated repetition */ },
        RepetitionState::Degenerate => { /* abort stream */ break; },
    }
}

Architecture

┌─────────────────────────────────────────────────┐
│                 TokenOptimizer                   │
│  (orchestrates all optimization components)      │
├──────────┬──────────┬───────────┬───────────────┤
│  Budget  │ History  │  Prompt   │    Tools      │
│ Planner  │Compactor │ Optimizer │  Optimizer    │
│          │          │           │               │
│ allocate │ lossless │ trim to   │ select top-N  │
│ budget   │ extract  │ budget    │ compress      │
│ per-     │ LLM      │ preserve  │ schemas       │
│ component│ fallback │ identity  │               │
└──────────┴──────────┴───────────┴───────────────┘
                        │
              ┌─────────┴─────────┐
              │  Token Estimator  │
              │  (chars ÷ 4       │
              │   heuristic)      │
              └───────────────────┘

Components

Component Module Purpose
TokenOptimizer optimizer Central orchestrator; impact-ordered pipeline
TokenBudget budget Allocates context window; adaptive rebalancing
TokenEstimator estimator Heuristic char÷4 counting; HF tokenizer backend
HistoryCompactor history::compactor Lossless → extractive → LLM three-tier compaction
deduplicate_rag_across_turns prompt::rag_cross_turn_dedup Decay-based cross-turn RAG dedup
optimize_system_prompt prompt::system_prompt Pressure-triggered trim + conciseness inject
compress_old_tool_results tools::result_truncator Extractive truncation of historical tool outputs
ToolUsageTracker tools::progressive Progressive schema stripping for seen tools
RepetitionDetector stream::repetition N-gram degenerate output detection
OptimizationMetrics metrics Prometheus counters and gauges
HfTokenEstimator estimator_hf HuggingFace tokenizers crate backend
TemplateLoader prompt::template_loader Runtime prompt template loading with fallback

Configuration

All fields are optional with sensible defaults. Deserialise from TOML/JSON via serde:

[token_optimization]
enabled = true                          # Master switch (default: true)
context_window_tokens = 8192            # Match your model's num_ctx (auto-detected at startup)
response_headroom_ratio = 0.25          # Fraction reserved for LLM output (default: 0.25)
compaction_trigger_ratio = 0.70         # Compact at this usage ratio (default: 0.70)
max_summary_tokens = 256                # Rolling summary token budget (default: 256)
system_prompt_budget_ratio = 0.15       # Fraction for system prompt (default: 0.15)
rag_budget_ratio = 0.15                 # Fraction for RAG context (default: 0.15)
repetition_detection_enabled = true     # Monitor output streams (default: true)
repetition_ngram_size = 3               # N-gram size for detection (default: 3)
repetition_threshold = 0.3              # Degenerate threshold (default: 0.3)
max_tools_per_request = 8               # Max tools per LLM call (default: 8)

# v2 features (all enabled by default)
output_max_tokens = 512                 # Hard cap on recommended max_tokens (default: none)
frequency_penalty = 1.1                 # Ollama repeat_penalty (default: none)
presence_penalty = 0.6                  # Ollama presence_penalty (default: none)
progressive_tool_compression = true     # Strip seen tool schemas on repeats (default: true)
conciseness_pressure_threshold = 0.7    # Brevity directive trigger (default: 0.7)
tool_result_max_tokens = 100            # Max tokens for historical tool results (default: 100)
max_history_tokens = 4096               # Token-budget window for history (default: auto)
max_profile_prompt_tokens = 300         # Agent profile section budget (default: 300)
prompt_template_dir = "/etc/pisovereign/prompts"  # Runtime template overrides (default: none)

# HuggingFace tokenizer (requires `hf-tokenizer` feature, on by default)
tokenizer_model = "meta-llama/Llama-3.2-3B"  # Model ID or local path (default: none)

Types

The crate provides its own conversation types that work without any external dependencies:

use ai_tokenopt::types::{
    Conversation,       // Conversation with messages and optional context
    ChatMessage,        // A single message (role + content)
    MessageRole,        // User, Assistant, System, Tool
    ToolDefinition,     // Tool name, description, and parameter schema
    ToolParameters,     // JSON Schema parameters
    ParameterProperty,  // Individual parameter definition
};

These types are minimal and focused — they expose only the fields relevant to token optimization.

Feature Flags

Feature Default Description
pisovereign off Zero-cost integration with PiSovereign's domain and application crates. Re-exports domain types directly and provides the TokenOptimizedInferencePort decorator.
hf-tokenizer on HuggingFace tokenizers crate backend for high-accuracy token counting. Disabling reduces compile time and binary size.
ollama off Enables HTTP-based OllamaSummarizationAdapter for LLM-assisted compaction. Requires reqwest.

Impact-Ordered Pipeline

TokenOptimizer::optimize_conversation() executes strategies in descending impact order. High-gain, zero-latency steps run first; expensive LLM operations run only if still necessary:

1. Cross-turn RAG deduplication      ← removes verbatim repeated context blocks
2. Conciseness pressure injection     ← adds brief brevity directive to system prompt
3. Progressive tool compression       ← strips schemas for tools seen in recent turns
4. Historical tool result truncation  ← caps tool output tokens for old messages
5. System prompt trim                 ← abbreviates/removes low-priority sections
6. Extractive history compaction      ← sentence-scored summary + oldest-first prune
7. LLM summarisation fallback         ← async; only if all else insufficient

optimize_conversation_with_tools() additionally:

  ─ Tool relevance scoring & selection ← keyword-overlap ranking
  ─ Schema progressive compression     ← seen tools lose verbose descriptions

The resulting OptimizationResult includes an OptimizationPlan listing which steps actually fired and their estimated savings:

let result = optimizer.optimize_conversation(&conv, None).await?;
for step in &result.plan.steps {
    println!("{}: ~{} tokens saved", step.name, step.estimated_savings);
}
println!("Total savings: {}", result.plan.total_estimated_savings());

How It Works

Token Estimation

Uses the chars ÷ 4 heuristic (~85% accurate for BPE tokenizers on English text). Non-ASCII-heavy text (>30% non-ASCII bytes) uses a more conservative chars ÷ 2.5 ratio. When the hf-tokenizer feature is enabled and a tokenizer_model is configured, the HfTokenEstimator is used for precise per-token counts. Each message adds 4 overhead tokens for role markers.

Budget Allocation

The context window is split across components:

┌─────────────────────────────────────────┐
│ Context Window (e.g., 8192 tokens)      │
├──────────────┬──────────────────────────┤
│ Response     │ Available Budget          │
│ Headroom     ├─────┬──────┬──────┬──────┤
│ (25%)        │ Sys │ RAG  │ Hist │Tools │
│              │ 15% │ 15%  │ rest │ est. │
└──────────────┴─────┴──────┴──────┴──────┘

Adaptive rebalancing: if the system prompt fits in less than its allocated ratio, the surplus is redistributed to history before any compaction fires.

Compaction Strategy

When history exceeds its budget allocation:

  1. Lossless: Collapse redundant whitespace in all messages
  2. Extractive: Remove oldest non-system messages (preserving recent turns), extract key sentences scored by role, position, recency, and information density
  3. LLM Fallback: If a SummarizationPort is provided, generate a concise summary of pruned messages

The rolling summary is stored in conversation.summary and prepended to the remaining history.

Tool Selection

Tools are scored by keyword overlap between the user's query and each tool's name, description, and parameter names. Tools explicitly mentioned by name always rank highest. The top-N tools (configurable via max_tools_per_request) are returned with compressed descriptions.

Progressive compression reduces schema verbosity for tools that appeared in recent turns: full schema on first use, parameters only on second, name only thereafter.

Runtime Prompt Template Overrides

The TemplateLoader resolves prompt templates at runtime from a configurable directory, falling back to the compiled-in defaults from YAML_PROMPTS:

use ai_tokenopt::TemplateLoader;

// Filesystem-first loader — set in config or pass None for compiled defaults
let loader = TemplateLoader::new(Some("/etc/pisovereign/prompts"));

// Returns Some(String) from disk, or the built-in template as fallback
if let Some(template) = loader.load("summarize") {
    // use custom template
}

Template files must be named <name>.prompt.txt. Place them in the directory configured via prompt_template_dir to override built-in prompts without recompiling.

Performance

  • Zero allocations for conversations within budget (fast-path bypass)
  • O(n) token estimation (single pass over message bytes)
  • Optional tokenizer — the heuristic estimator works with any model out of the box
  • Async-ready — all optimization methods are async for LLM fallback compatibility
  • Instrumented — all optimizer methods emit tracing spans for observability

Benchmarks

Run the Criterion benchmark suite:

cargo bench -p ai_tokenopt

Five benchmark groups:

Group Measures
token_estimation Heuristic estimator throughput across text types and lengths
budget_allocation Budget split across message counts (5–200 messages)
tool_compression Semantic tool selection; schema compression (5–50 tools)
history_compaction Full pipeline with forced compaction (10–200 messages)
full_pipeline End-to-end optimize_conversation and optimize_with_tools

Results are written to target/criterion/ with HTML reports.

License

MIT — see LICENSE for details.

Contributing

This crate is part of the PiSovereign project. Contributions welcome via pull requests.