ogham-server 0.3.0

Embeddable HTTP server for the Ogham context engineering SDK
Documentation

Ogham

CI crates.io docs.rs License: Apache-2.0

Ogham — the early-Irish script that carved language into compact notches. This SDK does the same to your LLM context.

Ogham is a pure-Rust SDK for LLM context engineering: it compresses, prunes, and budgets conversation context before it reaches the model — entirely in-process. Agentic workloads routinely burn 60–90% of their tokens on stale tool output; Ogham reclaims them without losing the things that matter (errors, decisions, file paths), and everything it removes stays retrievable.

No subprocesses. No network calls. No background tasks. Every operation is a deterministic library call.

What it does

Capability In one line
Content compression detects JSON / logs / code / diffs / prose and routes each to a purpose-built compressor
Reversible by default originals live in a CCR store (in-memory, SQLite, fjall); compressed text carries <<ccr:HASH>> markers the model can dereference
Agent-aware rules clears stale successful tool results to one-line stubs; never touches errors, system prompts, the latest user query, or pinned messages
Token budgets a degradation cascade (compress → summarize → drop) that fits any history into a limit — or refuses with BudgetExceeded rather than overflow
Honest token counting exact for OpenAI encodings (tiktoken feature); calibrated estimates with an explicit safety margin elsewhere — Claude tokenizers aren't public and Ogham doesn't pretend otherwise
Structured summaries deterministic extraction of files / decisions / errors / next-steps, plus a Summarizer trait for LLM-backed memory
Prompt-cache support byte-stable prefixes + provider breakpoint annotation for KV-cache reuse
Observable every decision streams through Observer/Metrics traits — auditable, testable

Install

[dependencies]
ogham = "0.3"
# exact OpenAI token counting:
# ogham = { version = "0.3", features = ["tiktoken"] }

Prebuilt ogham-server binaries (Linux, macOS universal, Windows; sha256 checksums) are on the releases page.

Sixty seconds

Compress a tool result (content level):

use ogham::{compress_messages, CompressConfig, Message};

let out = compress_messages(
    vec![Message::new("tool", huge_json_or_log_output)],
    CompressConfig::default(),
).await?;
println!("{} -> {} tokens ({:.0}% saved)",
    out.stats.original_tokens, out.stats.compressed_tokens,
    (1.0 - out.stats.ratio) * 100.0);

Fit a whole conversation into a budget (conversation level):

use std::sync::Arc;
use ogham::agent::{apply_agent_compression, AgentPolicy};
use ogham::budget::{enforce_budget, ContextBudget};
use ogham::ccr::in_memory::InMemoryCcrStore;
use ogham::pipeline::DefaultCompressionPipeline;
use ogham::counter_for_model;

let ccr = Arc::new(InMemoryCcrStore::new());
let pipeline = DefaultCompressionPipeline::with_ccr_store(ccr.clone());
let policy = AgentPolicy::default();

// 1. clear stale tool results to retrievable CCR stubs
apply_agent_compression(&mut messages, &policy, Some(ccr.clone())).await?;

// 2. guarantee the prompt fits — or get Err(BudgetExceeded), never an oversized send
enforce_budget(
    &mut messages,
    &ContextBudget { total_limit: 180_000, safety_margin: None },
    counter_for_model("claude-fable-5").as_ref(),
    &pipeline, &policy, Some(ccr),
).await?;

A cleared tool result looks like this — and the original is one ccr.retrieve(hash) away:

[tool:file_read] result cleared (512 tokens) — original retrievable via <<ccr:86a33abc...>>

Guarantees

  1. Fail-closed. Any internal error returns the original content unchanged. A budget that can't be met is an error, not a truncated prompt.
  2. Deterministic. Same input + same config ⇒ byte-identical output. No clocks, no randomness.
  3. Protected content. Tool errors, system instructions, and the latest user query survive every pass verbatim — enforced by tests, not convention.
  4. Measured, not fabricated. Stats reflect events that actually fired; estimated token counts are labelled as estimates.

Documentation

Doc Contents
docs/architecture.md crate layering, data flow, invariants, design decisions
docs/compression.md detection, the six compressors, CCR stores, cache alignment
docs/agent-context.md classification, tool clearing, budgets, summaries, cache strategy
docs/server.md the optional HTTP server: env vars, endpoints, embedding
docs/integration.md wiring Ogham into an agent runtime, end to end
DESIGN.md the full engineering record: build plan, work packages, ADRs
CONTRIBUTING.md · RELEASING.md · CHANGELOG.md process

API reference: docs.rs/ogham, or locally via cargo doc --workspace --no-deps --open.

Crates

Crate What When to depend on it
ogham the SDK: compressors, CCR, pipeline, agent rules, budgets, summaries almost always
ogham-core traits + types only (5 small deps, no runtime) implementing custom compressors/observers
ogham-server embeddable Axum HTTP server + standalone binary non-Rust clients, standalone deployment

Status

0.3.0 — APIs may change before 1.0 (semver-minor may break). The test suite covers fail-closed behavior, determinism, fuzzing, golden-file regression, and needle-in-haystack survival probes; see CONTRIBUTING.md for the gate every change must pass.

License & attribution

Apache-2.0 — see LICENSE.

Ogham's architecture is substantially derived from Headroom (Apache-2.0): the CCR reversible-storage concept, content-type routing, and the SmartCrusher / LogStripper compressor designs originate there. The agent-aware rules, token budgets, tiered summaries, and cache integration are informed by published research from Anthropic, Factory.ai, and Microsoft (LLMLingua). Full attribution in NOTICE.