ai_tokenopt 0.5.10

Adaptive token optimization engine for LLM inference pipelines — compresses prompts, conversation history, tool schemas, and output streams to minimize token usage while preserving response quality.
Documentation
# ai_tokenopt

**Full-spectrum, adaptive token optimization engine for LLM inference pipelines.**

Compresses prompts, conversation history, RAG context, tool schemas, tool results, and output
streams to minimise token usage while preserving response quality. Delivers **40–60% reduction**
in both input and output tokens across typical multi-turn conversation flows.

[![Crates.io](https://img.shields.io/crates/v/ai_tokenopt)](https://crates.io/crates/ai_tokenopt)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

---

## Overview

Large Language Models have finite context windows. As conversations grow, you face a choice:
truncate history (losing context) or exceed the budget (degrading quality). `ai_tokenopt` solves
this with a multi-strategy, impact-ordered pipeline that adapts to token pressure in real time.

### Compression tiers

| Tier | Strategy | Information Loss |
|------|----------|-----------------|
| 1 | **Lossless** — whitespace normalisation, cross-turn RAG dedup | None |
| 2 | **Extractive** — key-sentence extraction from pruned messages | Minimal |
| 3 | **LLM Fallback** — rolling summary via your LLM backend | Low (semantic) |

### Additional strategies (all active by default)

- **Output token budgeting** — dynamic `max_tokens`/`num_predict` from remaining budget
- **Tool schema compression** — shortens descriptions; progressive stripping for already-seen tools
- **Dynamic tool selection** — keyword-overlap relevance scoring, picks top-N tools per query
- **Tool result compression** — extractive truncation of historical tool outputs (~100 tokens)
- **System prompt trimming** — abbreviation and section removal under pressure
- **Conciseness injection** — brevity directive appended to system prompt at >70% pressure
- **Sampling parameters** — configurable `repeat_penalty` and `presence_penalty` forwarded to Ollama
- **Stream repetition detection** — terminates degenerate output loops early
- **Impact-ordered pipeline** — estimates savings per strategy, applies highest-impact first
- **Hardware auto-detection** — auto-detects Ollama context window and hardware tier at startup
- **HuggingFace tokenizer** — real BPE tokenisation with graceful heuristic fallback
- **Runtime prompt overrides** — swap compiled-in templates at runtime from a directory
- **Prometheus metrics**`tokenopt_tokens_saved_total`, strategy usage counters, reduction ratio
- **Tracing spans** — all major operations instrumented with `#[instrument]` for Jaeger/OTLP

## Quick Start

```toml
[dependencies]
ai_tokenopt = "0.5"
```

### Basic usage

```rust
use ai_tokenopt::{TokenOptimizer, TokenOptimizationConfig};
use ai_tokenopt::types::Conversation;

let optimizer = TokenOptimizer::new(TokenOptimizationConfig::default());

let mut conv = Conversation::with_system_prompt("You are a helpful assistant.");
conv.add_user_message("What's the weather like?");
conv.add_assistant_message("It's sunny and 22°C today.");
conv.add_user_message("Should I bring an umbrella tomorrow?");

// Optimise (tiers 1–2 only; no LLM required)
let result = optimizer.optimize_conversation(&mut conv, None).await?;
println!("Tokens before: {}", result.estimate_before.total);
println!("Tokens after:  {}", result.estimate_after.total);
// Pass recommended_max_tokens directly to Ollama as `num_predict`
if let Some(max_toks) = result.recommended_max_tokens {
    println!("Recommended max_tokens: {max_toks}");
}
println!("Optimization plan: {} steps, ~{} tokens estimated savings",
    result.plan.steps.len(), result.plan.total_estimated_savings());
```

### With LLM-based summarisation (tier 3)

```rust
use ai_tokenopt::ports::SummarizationPort;
use ai_tokenopt::TokenOptError;
use async_trait::async_trait;

struct MyLlmBackend;

#[async_trait]
impl SummarizationPort for MyLlmBackend {
    async fn summarize(&self, system_prompt: &str, text: &str) -> Result<String, TokenOptError> {
        // Call your LLM here (e.g., OpenAI, Ollama, llama.cpp)
        Ok(my_llm_call(system_prompt, text).await?)
    }
}

// Pass to the optimizer for tier-3 compaction
let llm = MyLlmBackend { /* ... */ };
let result = optimizer.optimize_conversation(&mut conv, Some(&llm)).await?;
```

### Tool Optimization

```rust
use ai_tokenopt::types::{ToolDefinition, ToolParameters, ParameterProperty};
use std::collections::HashMap;

let tools: Vec<ToolDefinition> = vec![/* your tool definitions */];

// Select and compress the most relevant tools for a query
let optimized_tools = optimizer.optimize_tools("What's the weather?", &tools);
// Returns only the tools relevant to the query, with compressed descriptions
```

### Stream Repetition Detection

```rust
use ai_tokenopt::stream::repetition::RepetitionDetector;

let mut detector = RepetitionDetector::new(
    3,    // n-gram size
    0.3,  // threshold: 30% repetition triggers detection
);

// Feed output chunks as they arrive
for chunk in stream {
    match detector.feed(&chunk) {
        RepetitionState::Normal => { /* continue */ },
        RepetitionState::Warning(ratio) => { /* elevated repetition */ },
        RepetitionState::Degenerate => { /* abort stream */ break; },
    }
}
```

## Architecture

```text
┌─────────────────────────────────────────────────┐
│                 TokenOptimizer                   │
│  (orchestrates all optimization components)      │
├──────────┬──────────┬───────────┬───────────────┤
│  Budget  │ History  │  Prompt   │    Tools      │
│ Planner  │Compactor │ Optimizer │  Optimizer    │
│          │          │           │               │
│ allocate │ lossless │ trim to   │ select top-N  │
│ budget   │ extract  │ budget    │ compress      │
│ per-     │ LLM      │ preserve  │ schemas       │
│ component│ fallback │ identity  │               │
└──────────┴──────────┴───────────┴───────────────┘
              ┌─────────┴─────────┐
              │  Token Estimator  │
              │  (chars ÷ 4       │
              │   heuristic)      │
              └───────────────────┘
```

### Components

| Component | Module | Purpose |
|-----------|--------|---------|
| `TokenOptimizer` | `optimizer` | Central orchestrator; impact-ordered pipeline |
| `TokenBudget` | `budget` | Allocates context window; adaptive rebalancing |
| `TokenEstimator` | `estimator` | Heuristic char÷4 counting; HF tokenizer backend |
| `HistoryCompactor` | `history::compactor` | Lossless → extractive → LLM three-tier compaction |
| `deduplicate_rag_across_turns` | `prompt::rag_cross_turn_dedup` | Decay-based cross-turn RAG dedup |
| `optimize_system_prompt` | `prompt::system_prompt` | Pressure-triggered trim + conciseness inject |
| `compress_old_tool_results` | `tools::result_truncator` | Extractive truncation of historical tool outputs |
| `ToolUsageTracker` | `tools::progressive` | Progressive schema stripping for seen tools |
| `RepetitionDetector` | `stream::repetition` | N-gram degenerate output detection |
| `OptimizationMetrics` | `metrics` | Prometheus counters and gauges |
| `HfTokenEstimator` | `estimator_hf` | HuggingFace `tokenizers` crate backend |
| `TemplateLoader` | `prompt::template_loader` | Runtime prompt template loading with fallback |

## Configuration

All fields are optional with sensible defaults. Deserialise from TOML/JSON via `serde`:

```toml
[token_optimization]
enabled = true                          # Master switch (default: true)
context_window_tokens = 8192            # Match your model's num_ctx (auto-detected at startup)
response_headroom_ratio = 0.25          # Fraction reserved for LLM output (default: 0.25)
compaction_trigger_ratio = 0.70         # Compact at this usage ratio (default: 0.70)
max_summary_tokens = 256                # Rolling summary token budget (default: 256)
system_prompt_budget_ratio = 0.15       # Fraction for system prompt (default: 0.15)
rag_budget_ratio = 0.15                 # Fraction for RAG context (default: 0.15)
repetition_detection_enabled = true     # Monitor output streams (default: true)
repetition_ngram_size = 3               # N-gram size for detection (default: 3)
repetition_threshold = 0.3              # Degenerate threshold (default: 0.3)
max_tools_per_request = 8               # Max tools per LLM call (default: 8)

# v2 features (all enabled by default)
output_max_tokens = 512                 # Hard cap on recommended max_tokens (default: none)
frequency_penalty = 1.1                 # Ollama repeat_penalty (default: none)
presence_penalty = 0.6                  # Ollama presence_penalty (default: none)
progressive_tool_compression = true     # Strip seen tool schemas on repeats (default: true)
conciseness_pressure_threshold = 0.7    # Brevity directive trigger (default: 0.7)
tool_result_max_tokens = 100            # Max tokens for historical tool results (default: 100)
max_history_tokens = 4096               # Token-budget window for history (default: auto)
max_profile_prompt_tokens = 300         # Agent profile section budget (default: 300)
prompt_template_dir = "/etc/pisovereign/prompts"  # Runtime template overrides (default: none)

# HuggingFace tokenizer (requires `hf-tokenizer` feature, on by default)
tokenizer_model = "meta-llama/Llama-3.2-3B"  # Model ID or local path (default: none)
```

## Types

The crate provides its own conversation types that work without any external dependencies:

```rust
use ai_tokenopt::types::{
    Conversation,       // Conversation with messages and optional context
    ChatMessage,        // A single message (role + content)
    MessageRole,        // User, Assistant, System, Tool
    ToolDefinition,     // Tool name, description, and parameter schema
    ToolParameters,     // JSON Schema parameters
    ParameterProperty,  // Individual parameter definition
};
```

These types are minimal and focused — they expose only the fields relevant to token optimization.

## Feature Flags

| Feature | Default | Description |
|---------|---------|-------------|
| `pisovereign` | off | Zero-cost integration with PiSovereign's `domain` and `application` crates. Re-exports domain types directly and provides the `TokenOptimizedInferencePort` decorator. |
| `hf-tokenizer` | on | HuggingFace `tokenizers` crate backend for high-accuracy token counting. Disabling reduces compile time and binary size. |
| `ollama` | off | Enables HTTP-based `OllamaSummarizationAdapter` for LLM-assisted compaction. Requires `reqwest`. |

## Impact-Ordered Pipeline

`TokenOptimizer::optimize_conversation()` executes strategies in descending impact order. High-gain, zero-latency steps run first; expensive LLM operations run only if still necessary:

```text
1. Cross-turn RAG deduplication      ← removes verbatim repeated context blocks
2. Conciseness pressure injection     ← adds brief brevity directive to system prompt
3. Progressive tool compression       ← strips schemas for tools seen in recent turns
4. Historical tool result truncation  ← caps tool output tokens for old messages
5. System prompt trim                 ← abbreviates/removes low-priority sections
6. Extractive history compaction      ← sentence-scored summary + oldest-first prune
7. LLM summarisation fallback         ← async; only if all else insufficient
```

`optimize_conversation_with_tools()` additionally:

```text
  ─ Tool relevance scoring & selection ← keyword-overlap ranking
  ─ Schema progressive compression     ← seen tools lose verbose descriptions
```

The resulting `OptimizationResult` includes an `OptimizationPlan` listing which steps
actually fired and their estimated savings:

```rust
let result = optimizer.optimize_conversation(&conv, None).await?;
for step in &result.plan.steps {
    println!("{}: ~{} tokens saved", step.name, step.estimated_savings);
}
println!("Total savings: {}", result.plan.total_estimated_savings());
```

## How It Works

### Token Estimation

Uses the `chars ÷ 4` heuristic (~85% accurate for BPE tokenizers on English text). Non-ASCII-heavy text (>30% non-ASCII bytes) uses a more conservative `chars ÷ 2.5` ratio. When the `hf-tokenizer` feature is enabled and a `tokenizer_model` is configured, the `HfTokenEstimator` is used for precise per-token counts. Each message adds 4 overhead tokens for role markers.

### Budget Allocation

The context window is split across components:

```text
┌─────────────────────────────────────────┐
│ Context Window (e.g., 8192 tokens)      │
├──────────────┬──────────────────────────┤
│ Response     │ Available Budget          │
│ Headroom     ├─────┬──────┬──────┬──────┤
│ (25%)        │ Sys │ RAG  │ Hist │Tools │
│              │ 15% │ 15%  │ rest │ est. │
└──────────────┴─────┴──────┴──────┴──────┘
```

Adaptive rebalancing: if the system prompt fits in less than its allocated ratio, the
surplus is redistributed to history before any compaction fires.

### Compaction Strategy

When history exceeds its budget allocation:

1. **Lossless**: Collapse redundant whitespace in all messages
2. **Extractive**: Remove oldest non-system messages (preserving recent turns), extract key sentences scored by role, position, recency, and information density
3. **LLM Fallback**: If a `SummarizationPort` is provided, generate a concise summary of pruned messages

The rolling summary is stored in `conversation.summary` and prepended to the remaining history.

### Tool Selection

Tools are scored by keyword overlap between the user's query and each tool's name, description, and parameter names. Tools explicitly mentioned by name always rank highest. The top-N tools (configurable via `max_tools_per_request`) are returned with compressed descriptions.

Progressive compression reduces schema verbosity for tools that appeared in recent turns:
full schema on first use, parameters only on second, name only thereafter.

## Runtime Prompt Template Overrides

The `TemplateLoader` resolves prompt templates at runtime from a configurable directory,
falling back to the compiled-in defaults from `YAML_PROMPTS`:

```rust
use ai_tokenopt::TemplateLoader;

// Filesystem-first loader — set in config or pass None for compiled defaults
let loader = TemplateLoader::new(Some("/etc/pisovereign/prompts"));

// Returns Some(String) from disk, or the built-in template as fallback
if let Some(template) = loader.load("summarize") {
    // use custom template
}
```

Template files must be named `<name>.prompt.txt`. Place them in the directory configured
via `prompt_template_dir` to override built-in prompts without recompiling.

## Performance

- **Zero allocations** for conversations within budget (fast-path bypass)
- **O(n)** token estimation (single pass over message bytes)
- **Optional tokenizer** — the heuristic estimator works with any model out of the box
- **Async-ready** — all optimization methods are `async` for LLM fallback compatibility
- **Instrumented** — all optimizer methods emit `tracing` spans for observability

## Benchmarks

Run the Criterion benchmark suite:

```bash
cargo bench -p ai_tokenopt
```

Five benchmark groups:

| Group | Measures |
|-------|----------|
| `token_estimation` | Heuristic estimator throughput across text types and lengths |
| `budget_allocation` | Budget split across message counts (5–200 messages) |
| `tool_compression` | Semantic tool selection; schema compression (5–50 tools) |
| `history_compaction` | Full pipeline with forced compaction (10–200 messages) |
| `full_pipeline` | End-to-end `optimize_conversation` and `optimize_with_tools` |

Results are written to `target/criterion/` with HTML reports.

## License

MIT — see [LICENSE](../../LICENSE) for details.

## Contributing

This crate is part of the [PiSovereign](https://github.com/andreasreichel/PiSovereign) project. Contributions welcome via pull requests.