autoresearch 0.3.3

autoresearch-0.3.3 is not a library.

A single Rust binary that installs the autoresearch experiment loop into any AI coding agent, scaffolds your project, and tracks every experiment from the terminal. Designed from day one for LLM agents — structured JSON output, semantic exit codes, machine-readable capabilities, and auto-JSON when piped.

Works with Claude Code, Codex CLI, OpenCode, Cursor, Windsurf, or any agent framework that can shell out to a command.

autoresearch install claude-code
autoresearch init --target-file train.py --eval-command "python train.py" --metric-name val_bpb
# Tell your agent: /autoresearch
# Go to sleep. Wake up to results.

Why

Karpathy's autoresearch ran 126 ML experiments overnight on a single GPU. People have since applied the same pattern to chess engines (expert → grandmaster), Bitcoin modeling (halved prediction errors), Sudoku solvers (beat the paper in 5 minutes), and running 400B models on laptops.

The pattern is simple: one file to modify, one metric to optimize, one loop that never stops.

But every project reimplements this from scratch — copying program.md, figuring out the eval, hand-writing JSONL logs. autoresearch makes it a cargo install away.

Install

One-liner (macOS / Linux):

curl -LsSf https://github.com/199-biotechnologies/autoresearch-cli/releases/latest/download/autoresearch-installer.sh | sh

Homebrew:

brew tap 199-biotechnologies/tap
brew install autoresearch

Cargo (from crates.io):

cargo install autoresearch

Binary size: ~1.1MB. Startup: ~2ms. Memory: ~3MB. No Python, no Node, no Docker.

Quick Start

1. Install the skill into your agent

autoresearch install all           # Claude Code + Codex + OpenCode + Cursor + Windsurf

This writes the autoresearch loop instructions in each agent's native format — SKILL.md for Claude/Codex/OpenCode, .mdc for Cursor, frontmattered .md for Windsurf.

2. Initialize your project

cd your-project
autoresearch init \
  --target-file train.py \
  --eval-command "python train.py" \
  --metric-name val_bpb \
  --metric-direction lower \
  --time-budget 5m

Or run autoresearch init without flags for interactive setup.

This creates:

autoresearch.toml — experiment configuration
program.md — research direction and ideas (edit this!)
.autoresearch/ — experiment logs

3. Pre-flight check

autoresearch doctor

Runs 14 checks: git repo, config valid, target file exists, eval command runs, metric parseable, branch state, stale locks, program.md, working tree clean.

4. Start the loop

In your agent:

/autoresearch

The agent reads your config, creates a git branch, runs the baseline, and starts iterating. Each experiment: modify the file → eval → keep or revert → record → repeat. Forever.

5. Wake up to results

autoresearch status              # Quick overview
autoresearch log                 # Full experiment history
autoresearch best                # Best result + diff from baseline
autoresearch report              # Full markdown research report
autoresearch diff 12 45          # Compare any two experiments
autoresearch export --format csv # Export for analysis

Commands

Command	Description	Agent-facing
`install <target>`	Install skill into an AI agent
`init`	Scaffold project (`autoresearch.toml` + `program.md`)
`doctor`	14-point pre-flight check before starting a loop	*
`record`	Record experiment result (handles JSONL, run numbering, deltas)	*
`log`	Show experiment history with metrics and status	*
`best`	Show best experiment + diff from baseline	*
`diff <a> <b>`	Compare two experiments side-by-side	*
`status`	Project state, best metric, loop status	*
`export`	Export as CSV, JSON, or JSONL
`fork <names...>`	Branch experiments into parallel directions for multi-agent exploration
`review`	Generate cross-model review prompt with pattern detection	*
`watch`	Live terminal dashboard — watch experiments in real time
`merge-best`	Compare fork branches and identify the winner	*
`report`	Generate markdown research report
`agent-info`	Machine-readable capability metadata	*

All commands support --json for structured output. Auto-enabled when piped.

Agent-facing commands (*) are designed for LLMs to call directly. They return consistent JSON envelopes with semantic exit codes (0=success, 1=runtime error, 2=config error) and actionable suggestion fields on errors.

How It Works

You write program.md          The agent runs the loop
     ┌──────────┐          ┌──────────────────────┐
     │  Ideas   │          │  1. Read program.md   │
     │  Papers  │ ───────► │  2. Modify target     │
     │  Goals   │          │  3. Commit            │
     └──────────┘          │  4. Eval (timeout)    │
                           │  5. Keep or revert    │
autoresearch.toml           │  6. autoresearch      │
     ┌──────────┐          │     record --metric   │
     │ target   │ ───────► │  7. Repeat forever    │
     │ eval_cmd │          └──────────────────────┘
     │ metric   │                    │
     └──────────┘                    ▼
                           .autoresearch/
                           experiments.jsonl
                           ┌─────────────────────┐
                           │ run 0: 1.050 baseline│
                           │ run 1: 1.042 kept    │
                           │ run 2: 1.055 discard │
                           │ run 3: 1.031 kept    │
                           └─────────────────────┘

The CLI handles everything except the loop itself:

Scaffolding — init creates the config and research prompt
Validation — doctor checks everything before you start
State management — record handles JSONL atomically (agents never hand-write JSON)
Tracking — log, best, diff, status parse experiments from git + JSONL
Reporting — report generates a shareable markdown summary

The agent handles the creative work — deciding what to try, implementing changes, interpreting results.

Agent Integration

Supported Agents

Agent	Format	Install path	Slash command
Claude Code	SKILL.md	`~/.claude/skills/autoresearch/`	`/autoresearch`
Gemini CLI	SKILL.md	`~/.gemini/skills/autoresearch/`	auto-discovered
Codex CLI	SKILL.md	`~/.codex/skills/autoresearch/`	`/autoresearch`
OpenCode	SKILL.md	`~/.config/opencode/skills/autoresearch/`	`/autoresearch`
GitHub Copilot	SKILL.md	`.github/skills/autoresearch/`	auto-discovered
Cursor	SKILL.md	`.cursor/skills/autoresearch/`	auto-discovered
Windsurf	SKILL.md	`.windsurf/skills/autoresearch/`	auto-discovered
Augment/Goose/Roo	SKILL.md	`.agents/skills/autoresearch/`	auto-discovered

No skill? No problem. Run autoresearch guide to get the full methodology. The CLI coaches agents through hints in every response — skills enhance discovery but aren't required.

For Agent Developers

Every command supports --json and auto-detects piped output:

# Machine-readable capabilities
autoresearch agent-info --json

# Record experiment (agent calls this, never writes JSONL directly)
autoresearch record --metric 1.042 --status kept --summary "Tuned learning rate"

# Check state before next iteration
autoresearch status --json

Exit codes are semantic:

0 — success
1 — runtime error (retry might help)
2 — config/usage error (fix config, don't retry)

Error JSON includes machine-readable codes and recovery suggestions:

{
  "status": "error",
  "error": {
    "code": "no_experiments",
    "message": "No experiments found on branch 'autoresearch'.",
    "suggestion": "Run `autoresearch init` then start the autoresearch loop in your agent."
  }
}

Configuration

autoresearch.toml:

target_file = "train.py"           # The single file the agent may modify
eval_command = "python train.py"   # Must print the metric to stdout
metric_name = "val_bpb"            # What the metric is called
metric_direction = "lower"         # "lower" or "higher"
time_budget = "5m"                 # Max time per experiment
branch = "autoresearch"            # Git branch for experiments

program.md is free-form — tell the agent what to explore, link papers, set constraints. The agent reads it between experiments for inspiration.

Multi-Direction Exploration

Fork: Parallel Branches

When you want to explore multiple directions simultaneously:

autoresearch fork try-transformers try-convolutions try-linear

Creates autoresearch-fork-try-transformers, autoresearch-fork-try-convolutions, etc. from the current best. Start a separate agent on each:

git checkout autoresearch-fork-try-transformers && /autoresearch

Review: Cross-Model Second Opinions

After running experiments, get a second model to review your progress:

autoresearch review --json | jq -r '.data.review_prompt'

Generates a structured review prompt with:

Session summary (kept/discarded rates, baseline vs best)
Pattern detection (stuck detection, repeated failure themes)
Specific questions for the reviewer to answer
Suggested next directions

Pipe to Codex or Gemini for cross-model insights that break local minima.

Watch: Live Dashboard

Monitor experiments in real time from another terminal:

autoresearch watch

Shows a live-updating dashboard with sparkline progress, kept/discarded rates, best metric, and new experiment notifications. Refreshes every 2 seconds (configurable with -i).

Merge Best: Pick the Winner

After forks finish exploring, compare them and find the winner:

autoresearch merge-best

Ranks all fork branches by best metric, shows a comparison table, and gives you the git commands to merge the winner and clean up losers.

What People Are Building

The autoresearch pattern works on anything with a measurable metric:

Domain	Metric	Result
ML training	val_bpb	126 experiments overnight, 11% improvement
Chess engine	Elo rating	Expert → Grandmaster (2718 Elo)
Bitcoin modeling	Prediction error	Halved error in one morning
Sudoku solver	Accuracy	Beat published paper (87% → 92.2%)
API latency	p99 ms	37% reduction via KD-tree optimization
Trading bots	Score	43,000% improvement via evolutionary loop

Inspired By

Karpathy's autoresearch — the original pattern (42K stars)
uditgoenka/autoresearch — generalized Claude Code skill (608 stars)
ARIS — cross-model research pipeline
ResearcherSkill — domain-agnostic research agent
SkyPilot scaling — multi-GPU parallel autoresearch

License

MIT — 199 Biotechnologies