every-other-token
A real-time LLM token stream interceptor for interpretability research. Sits between your application and the model, mutates tokens mid-generation, captures per-token confidence and perplexity signals, and renders results in a zero-dependency terminal or web UI.
What it does
Aggregate benchmarks measure final outputs. every-other-token measures what happens during generation — token by token, position by position — with confidence scores, perplexity signals, and cross-provider structural comparison running simultaneously.
It directly enables four research directions:
- Semantic fragility — At what perturbation rate does coherent reasoning collapse?
- Cross-provider divergence — Do OpenAI and Anthropic produce structurally different token sequences for identical prompts?
- System prompt sensitivity — How much does framing shift per-token confidence distributions?
- Chaos resilience — Do models self-correct when every other token is randomly mutated?
How it works
The tool intercepts the SSE (server-sent events) stream produced by the provider API. Each chunk is parsed into individual tokens. For every token a decision is made — based on a Bresenham-spread rate schedule — whether to apply the active transform. The enriched event is then routed to the terminal renderer, web UI, WebSocket collaboration room, JSON-stream output, or replay recorder simultaneously.
Pipeline overview
CLI (clap) ──► main.rs
│
▼
TokenInterceptor (lib.rs)
│ │
▼ ▼
OpenAiPlugin AnthropicPlugin
(SSE + logprobs) (SSE)
│
▼ per token
process_content_logprob()
├── confidence = exp(logprob)
├── perplexity = exp(-logprob)
├── alternatives = top_logprobs[0..N]
└── transform = Transform::apply(token)
│
├──► web_tx ──► SSE ──► browser (web.rs)
├──► collab ──► WebSocket ──► room participants (collab.rs)
├──► stdout ──► terminal renderer (render.rs)
├──► Recorder ──► JSON replay file (replay.rs)
└──► HeatmapExporter ──► CSV (heatmap.rs)
Research mode (research.rs)
run_research() × N ──► ResearchOutput (JSON)
run_research_suite() ──► batch over prompt file
Self-tune (feature = "self-tune")
TelemetryBus ──► AnomalyDetector ──► TuningController ──► SnapshotStore
Self-modify (feature = "self-modify")
TaskGen ──► ValidationGate ──► Deploy ──► Memory
Key modules
| Module | Responsibility |
|---|---|
lib.rs |
TokenInterceptor, TokenEvent, stream parsing, retry logic |
transforms.rs |
All transform strategies (Reverse, Noise, Chaos, Chain, …) |
providers.rs |
ProviderPlugin trait, OpenAI and Anthropic SSE wire types, MCP types |
web.rs |
Embedded HTTP/1.1 server, SSE fan-out, WebSocket upgrade |
collab.rs |
Room store, participant management, surgery edits, chat, recording |
research.rs |
Headless research loop, aggregate statistics, A/B mode, heatmap export |
store.rs |
SQLite-backed experiment persistence, cross-session dedup cache |
heatmap.rs |
Per-position confidence matrix → CSV |
replay.rs |
JSON recording and deterministic replay |
render.rs |
Terminal ANSI colouring, confidence indicators, visual-mode formatting |
config.rs |
~/.eot.toml / ./.eot.toml config file with merge semantics |
cli.rs |
Clap argument definitions and helper functions |
error.rs |
EotError enum — one variant per failure domain |
Feature flags
| Flag | Default | Description |
|---|---|---|
| (none) | ✓ | Terminal and web UI streaming, all transforms, research mode, collab rooms |
sqlite-log |
Persist experiment runs to a local SQLite database via store::ExperimentStore |
|
self-tune |
Background PID-based parameter tuning loop + telemetry bus | |
self-modify |
Agent loop for automated pipeline improvement (requires self-tune) |
|
intelligence |
Reserved namespace for future interpretability features | |
evolution |
Reserved namespace for evolutionary optimisation | |
helix-bridge |
HTTP bridge that polls HelixRouter /api/stats and pushes config patches |
|
redis-backing |
Write-through Redis persistence for agent memory and snapshots | |
wasm |
WASM target bindings via wasm-bindgen |
Quickstart
Prerequisites
- Rust 1.75 or later
- An OpenAI API key (
OPENAI_API_KEY) and/or Anthropic API key (ANTHROPIC_API_KEY)
Basic usage
# Terminal output with per-token confidence color bands
# Web UI on http://localhost:8888 — opens browser automatically
# Headless research: 20 runs, JSON aggregate stats
# Side-by-side OpenAI vs Anthropic diff in the terminal
# A/B system-prompt experiment with significance testing
Shell completions
Dry run (no API key required)
Transforms
| Name | Behavior | Deterministic? |
|---|---|---|
reverse |
Reverses token characters: "hello" → "olleh" |
Yes |
uppercase |
Converts to uppercase: "hello" → "HELLO" |
Yes |
mock |
Alternating lower/upper per char: "hello" → "hElLo" |
Yes |
noise |
Appends a random symbol from * + ~ @ # $ % |
No (seeded with --seed) |
chaos |
Randomly selects one of the above per token | No (seeded with --seed) |
scramble |
Fisher-Yates shuffles token characters | No (seeded with --seed) |
delete |
Replaces the token with the empty string | Yes |
synonym |
Substitutes from a 200-entry static synonym table | Yes |
delay:N |
Passes through after N ms pause | Yes |
A,B,... |
Chain: applies A, then B, then … in sequence | Depends on chain |
Rate control — --rate 0.5 (default): every other token is transformed.
Uses a Bresenham spread for deterministic, uniform distribution at any rate.
Combine with --seed for fully reproducible runs.
Stochastic rate — --rate-range 0.3-0.7 picks a random rate in [min, max] per run.
Confidence gating — --min-confidence 0.8 only transforms tokens whose API confidence is below 0.8. High-confidence tokens pass through unchanged.
Web UI modes
Launch with --web to open the single-page application:
| Mode | Description |
|---|---|
| Single | Live token stream with per-token confidence bars and perplexity pulse |
| Split | Original vs transformed side by side |
| Quad | All four transforms applied simultaneously in a 2×2 grid |
| Diff | OpenAI and Anthropic streaming the same prompt in parallel, diverging positions highlighted |
| Experiment | A/B mode: two system prompts, same user prompt, live divergence map |
| Research | Aggregate stats dashboard: perplexity histogram, confidence distribution, vocab diversity |
Configuration file
Create ~/.eot.toml (global) or .eot.toml in the working directory (local wins over global):
= "anthropic"
= "claude-sonnet-4-6"
= "reverse"
= 0.5
= 8888
= 5
= "You are a concise assistant."
CLI flags override config file values. The rate field is clamped to [0.0, 1.0] with a warning if out of range.
CLI reference
USAGE:
every-other-token [OPTIONS] <PROMPT> [TRANSFORM] [MODEL]
ARGS:
<PROMPT> Input prompt (use "-" to read from stdin)
[TRANSFORM] Transform type [default: reverse]
[MODEL] Model name [default: gpt-3.5-turbo]
OPTIONS:
--provider <PROVIDER> openai | anthropic | mock [default: openai]
--visual, -v Enable ANSI confidence-colored output
--heatmap Enable token importance heatmap
--orchestrator Route through MCP pipeline at localhost:3000
--web Launch web UI instead of terminal
--port <PORT> Web UI port [default: 8888]
--research Headless research mode
--runs <RUNS> Number of research runs [default: 10]
--output <FILE> Research output JSON path [default: research_output.json]
--system-a <PROMPT> System prompt A (A/B mode)
--system-b <PROMPT> System prompt B (A/B mode)
--top-logprobs <N> Top alternative tokens per position (0–20) [default: 5]
--db <FILE> SQLite database for experiment persistence
--significance Compute Welch's t-test across A/B confidence distributions
--heatmap-export <FILE> Export per-position confidence heatmap to CSV
--heatmap-min-confidence <F> Minimum mean confidence for heatmap rows [default: 0.0]
--heatmap-sort-by <FIELD> Sort heatmap by "position" or "confidence" [default: position]
--record <FILE> Record token events to JSON replay file
--replay <FILE> Replay token events from file (no API call)
--rate <F> Fraction of tokens to transform (0.0–1.0) [default: 0.5]
--rate-range <MIN-MAX> Stochastic rate from interval (e.g. "0.3-0.7")
--seed <N> Fixed RNG seed for reproducible Noise/Chaos runs
--baseline Compare against stored "none" transform runs in SQLite
--prompt-file <FILE> Batch research: one prompt per line
--diff-terminal Parallel OpenAI + Anthropic streams side by side
--json-stream One JSON line per token to stdout
--dry-run Validate transform without calling any API
--template <TPL> Prompt template with {input} placeholder
--min-confidence <F> Only transform tokens with confidence below this value
--format <FMT> Research output format: "json" or "jsonl" [default: json]
--collapse-window <N> Confidence collapse detection window [default: 5]
--orchestrator-url <URL> MCP orchestrator base URL [default: http://localhost:3000]
--max-retries <N> API retry attempts on 429/5xx [default: 3]
--completions <SHELL> Generate shell completions (bash/zsh/fish/…)
--log-db <FILE> SQLite experiment log (requires sqlite-log feature)
API reference (library)
Add to Cargo.toml:
[]
= "4"
Core types
use ;
use Provider;
use Transform;
let mut interceptor = new?
.with_rate
.with_seed;
interceptor.intercept_stream.await?;
Web UI / channel mode
use ;
use mpsc;
let = ;
interceptor.web_tx = Some;
interceptor.intercept_stream.await?;
while let Some = rx.recv.await
Experiment store
use ;
let store = open?;
let id = store.insert_experiment?;
store.insert_run?;
Performance
- Sub-millisecond per-token processing overhead (Bresenham spread, no heap allocation per token)
- Zero-copy async streaming via Tokio with back-pressure on the broadcast channel
- ~4 MB release binary with LTO +
strip = true - Parallel provider streams via
tokio::select!/tokio::join! - Exponential back-off retry on 429 / 5xx responses (up to
--max-retriesattempts)
Contributing
Contributions are welcome. Please follow these steps:
- Fork the repository and create a feature branch from
main. - Run
cargo fmtandcargo clippy -- -D warningsbefore committing. - Add tests for any new public API surface. The CI gate requires all tests to pass on stable and the MSRV (1.75).
- Open a pull request against
mainwith a clear description of the change and why it is needed. - For significant changes, open an issue first to discuss the design.
Development commands
# Build
# Tests (all feature combinations)
# Lint
# Format check
# Docs (with warnings-as-errors)
RUSTDOCFLAGS="-D warnings"
# Security audit
# Dependency policy check
# Release build
Project layout
src/
├── lib.rs # TokenInterceptor, TokenEvent, stream loop
├── main.rs # CLI entry point
├── cli.rs # Clap argument struct + helpers
├── config.rs # .eot.toml config file support
├── error.rs # EotError enum
├── providers.rs # ProviderPlugin trait + wire types
├── transforms.rs # Transform enum + all strategies
├── render.rs # Terminal ANSI rendering helpers
├── web.rs # Embedded HTTP/WebSocket server
├── collab.rs # Multiplayer room management
├── research.rs # Headless research loop + stats
├── store.rs # SQLite experiment persistence
├── heatmap.rs # Per-position confidence CSV export
├── replay.rs # Token event recording + replay
├── self_tune/ # (feature: self-tune) PID tuning loop
├── self_modify/ # (feature: self-modify) Agent improvement loop
├── helix_bridge/ # (feature: helix-bridge) HelixRouter HTTP bridge
├── semantic_dedup.rs # (feature: self-modify) In-session prompt dedup
└── experiment_log.rs # (feature: sqlite-log) SQLite experiment logger
tests/
├── collab_tests.rs
├── providers_tests.rs
├── transforms_tests.rs
├── store_heatmap_replay_tests.rs
├── self_tune_integration.rs
└── web_integration.rs
License
MIT — see LICENSE.
Ecosystem
- tokio-prompt-orchestrator — the orchestration layer that uses Every-Other-Token telemetry to drive HelixRouter adaptation
- LLM-Hallucination-Detection-Script — companion tool for output-level reliability analysis
- Token-Visualizer — interactive tokenizer and prompt-engineering visualization tool
Troubleshooting
"Missing API key" — Set the OPENAI_API_KEY or ANTHROPIC_API_KEY environment variable before running. Use export OPENAI_API_KEY=sk-... (Linux/macOS) or set OPENAI_API_KEY=sk-... (Windows).
"Model not found" — Check the model name spelling; valid examples are gpt-4o, gpt-3.5-turbo, claude-sonnet-4-6. Use --dry-run to test transform logic without making any API call.
"Rate limited (429)" — The built-in circuit breaker will retry automatically after a backoff. If the circuit opens (5 consecutive failures), it stays open for 30 seconds then resets. Wait 30 s and retry, or reduce request frequency.
"Stream times out" — Try a shorter prompt or a faster model. Increase the timeout with --timeout 300 (seconds) if the model is legitimately slow. Default timeout is 120 s.
"Web UI blank" — Open the browser developer console (F12) and check for errors. Common causes: --provider does not match the API key that is set, the server is not running on the expected port, or a browser extension is blocking the SSE connection.
"Port already in use" — Use --port 9000 (or any free port) to pick a different port. Find the occupying process with lsof -i :<port> (Linux/macOS) or netstat -ano | findstr :<port> (Windows).