harness

Agent = Model + Harness. This is the Harness — the scaffolding that turns an LLM into an autonomous agent. Any domain: research, ops, assistants, data work, coding.

A Rust framework for production agents, built on the harness engineering discipline (Böckeler/Thoughtworks, Lopopolo/OpenAI, 2026). Compile-time type-safe, deterministic-first, observable, governance built in. Full rationale in DESIGN.md.

What you get

Layer	What	Crate
Models	3 protocol families (OpenAI-compat · Anthropic · Gemini), one `ApiKind::build(url, model, key)`	`harness-models`
Tools	fs · shell (risk-gated) · web search/fetch	`harness-tools-*`
Loop	ReAct + tool dispatch + sensor feedback + auto-fix	`harness-loop`
Loop engineering	recurring loops: maturity levels L1/L2/L3, human gates, action executors, token budgets	`harness-loop-engine`
Orchestration	async Run = concurrent Job DAG + retry/backoff + dynamic replanning + resumable state	`harness-orchestrator`
Learning	record episodes (situation → tools used → outcome) + semantic recall · CortexDB-backed `Memory`	`harness-experience`, `harness-cortexdb`
Skills · Guides · Hooks · Sensors	proc-macro registered, agentskills.io-compliant	`harness-macros`, `harness-skills`
Memory · Recall	`Memory` trait + JSONL store · cross-session search (FTS5 / CJK)	`harness-core`, `harness-recall-sqlite`
Scheduler · MCP · Sandbox · CLI	recurring jobs · MCP server+client · OS-native sandbox (macOS Seatbelt) · git-worktree / Docker · `harness code` / `run` / `sched` / `new` / `mcp serve`	—

Quick start

use harness_loop::AgentLoop;
use harness_models::ApiKind;
use harness_tools_fs::{ListDir, ReadFile};
use harness_context::default_world;
use harness_core::Task;
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<(), harness::HarnessError> {
    // One model API: protocol family + base_url + model + key. No hardcoded URLs.
    let model = ApiKind::OpenAI.build("https://api.deepseek.com", "deepseek-chat",
        std::env::var("DEEPSEEK_API_KEY").unwrap());
    let mut world = default_world(".");
    let outcome = AgentLoop::new(model)
        .with_tool(Arc::new(ReadFile))
        .with_tool(Arc::new(ListDir))
        .run(Task { description: "What is the workspace name?".into(),
                    source: None, deadline: None }, &mut world)
        .await?;
    println!("{outcome:?}");
    Ok(())
}

Register tools/skills/guides/sensors/hooks with #[harness::tool] / #[skill] / #[guide] / #[sensor] / #[hook] — they auto-register via inventory. Scaffold a new project with harness new.

Composable layers

harness-loop runs one agent (ReAct: think → call tools → observe).
harness-loop-engine governs a recurring loop: it earns autonomy in stages — L1 report → L2 assisted (human gates every change) → L3 unattended (allowlisted actions only) — under a token budget, with an ActionExecutor for the side effect after a verified approval.
harness-orchestrator fans one goal across many concurrent, dependent Jobs (a DAG) with retry/backoff, a run budget, crash-resumable state, and dynamic replanning (a Planner mutates the DAG mid-run from results).
harness-experience makes an agent learn: it records each run as an episode (situation → tools used → outcome) and recalls similar past episodes to guide the next run. Pair with harness-cortexdb (a CortexDB-backed Memory) for semantic recall over a brain shared with Claude Code / Codex.

use harness_orchestrator::{Dag, Job, Orchestrator, Run, SubagentJobRunner};

// notion/airtable/coda run concurrently; `compare` waits for all three.
let dag = Dag::from_jobs([
    Job::new("notion", "what is Notion best at?"),
    Job::new("airtable", "what is Airtable best at?"),
    Job::new("coda", "what is Coda best at?"),
    Job::new("compare", "compare them").with_deps(["notion", "airtable", "coda"]),
]);
let report = Orchestrator::new(Arc::new(SubagentJobRunner::new(model, ".")))
    .run(Run::new("run-1", "compare tools", dag)).await;

Examples

See examples/ — memory, recall, the scheduler, MCP, experience-cortexdb (the learning layer over a CortexDB brain), cap (a coding agent reimplementing oh-my-pi's hashline editing — content-hash line anchors instead of line numbers), and two end-to-end agents over a live PostgreSQL database: ecommerce-analyst (concurrent analysis DAG) and ecommerce-ops-agent (the full stack — dynamic replanning, L1/L2/L3 governed DB writes, cross-run memory).

Principles

Don't burn tokens on what code can do — lint/format/git run via Sensors and Hooks, not the model. The Compactor manages scarce context.
Isolate, don't interrupt — permissions are decided at sandbox spawn, not prompted per call. Backends are honest about what they enforce (Isolation::{None, Changes, Process}): SeatbeltSandbox (macOS, kernel-level via sandbox-exec) and ContainerSandbox (Docker) enforce; WorktreeSandbox isolates git changes, not capability. Today a sandbox wraps shell exec; in-process fs tools are jailed separately.
Earn autonomy in stages — start at L1, set a budget, graduate only as you build trust. Unattended loops make unattended mistakes; verification is on you.

Coding agent

harness code is an interactive, opencode-style coding REPL built entirely on the framework above — multi-turn, streaming, with read/write/edit/list/grep/glob and shell tools. It runs in NORMAL mode (every write, edit, and shell command waits for a y/N you approve) or --yolo (unattended). A single Rust binary, any OpenAI-compatible model:

HARNESS_API_KEY=… HARNESS_BASE_URL=… HARNESS_MODEL=… harness code            # NORMAL
harness code --yolo --workspace .                                            # YOLO

Benchmarks

Measured cost on a fixed task set — deepseek-v4-flash via Aliyun MaaS, 2026-07-04. Every task finished (Done) with side effects verified (sum.txt = 42, etc.). Reproduce any row with harness run "<task>" --json:

task	iters	tool calls	in tok	out tok
list a directory	2	1	975	103
read a file, then answer	2	1	992	130
create a file	2	1	1350	107
read → sum numbers → write result	3	2	2336	260

File writes go through the write_file tool (small structured args), not the model re-emitting whole file bodies each turn — "don't burn tokens on what code can do", measured rather than asserted. cargo run -p eval-bench emits the same per-task cost fields for cross-framework comparison.

Status

Latest: v0.0.24 — honest, OS-native sandboxing: SeatbeltSandbox (macOS) / BubblewrapSandbox (Linux), an Isolation enum that reports what's actually enforced, harness code --sandbox, and an OS-enforced (cap-std) filesystem jail. Full history in CHANGELOG.md.

License

MIT OR Apache-2.0.

harness-rs-tools-shell 0.0.24