A single Rust binary that installs the autoresearch experiment loop into any AI coding agent, scaffolds your project, and tracks every experiment from the terminal. Designed from day one for LLM agents — structured JSON output, semantic exit codes, machine-readable capabilities, and auto-JSON when piped.
Works with Claude Code, Codex CLI, OpenCode, Cursor, Windsurf, or any agent framework that can shell out to a command.
# Tell your agent: /autoresearch
# Go to sleep. Wake up to results.
Why
Karpathy's autoresearch ran 126 ML experiments overnight on a single GPU. People have since applied the same pattern to chess engines (expert → grandmaster), Bitcoin modeling (halved prediction errors), Sudoku solvers (beat the paper in 5 minutes), and running 400B models on laptops.
The pattern is simple: one file to modify, one metric to optimize, one loop that never stops.
But every project reimplements this from scratch — copying program.md, figuring out the eval, hand-writing JSONL logs. autoresearch makes it a cargo install away.
Install
One-liner (macOS / Linux):
|
Homebrew:
Cargo (from crates.io):
Binary size: ~1.1MB. Startup: ~2ms. Memory: ~3MB. No Python, no Node, no Docker.
Quick Start
1. Install the skill into your agent
This writes the autoresearch loop instructions in each agent's native format — SKILL.md for Claude/Codex/OpenCode, .mdc for Cursor, frontmattered .md for Windsurf.
2. Initialize your project
Or run autoresearch init without flags for interactive setup.
This creates:
autoresearch.toml— experiment configurationprogram.md— research direction and ideas (edit this!).autoresearch/— experiment logs
3. Pre-flight check
Runs 14 checks: git repo, config valid, target file exists, eval command runs, metric parseable, branch state, stale locks, program.md, working tree clean.
4. Start the loop
In your agent:
/autoresearch
The agent reads your config, creates a git branch, runs the baseline, and starts iterating. Each experiment: modify the file → eval → keep or revert → record → repeat. Forever.
5. Wake up to results
Commands
| Command | Description | Agent-facing |
|---|---|---|
install <target> |
Install skill into an AI agent | |
init |
Scaffold project (autoresearch.toml + program.md) |
|
doctor |
14-point pre-flight check before starting a loop | * |
record |
Record experiment result (handles JSONL, run numbering, deltas) | * |
log |
Show experiment history with metrics and status | * |
best |
Show best experiment + diff from baseline | * |
diff <a> <b> |
Compare two experiments side-by-side | * |
status |
Project state, best metric, loop status | * |
export |
Export as CSV, JSON, or JSONL | |
fork <names...> |
Branch experiments into parallel directions for multi-agent exploration | |
review |
Generate cross-model review prompt with pattern detection | * |
watch |
Live terminal dashboard — watch experiments in real time | |
merge-best |
Compare fork branches and identify the winner | * |
report |
Generate markdown research report | |
agent-info |
Machine-readable capability metadata | * |
All commands support --json for structured output. Auto-enabled when piped.
Agent-facing commands (*) are designed for LLMs to call directly. They return consistent JSON envelopes with semantic exit codes (0=success, 1=runtime error, 2=config error) and actionable suggestion fields on errors.
How It Works
You write program.md The agent runs the loop
┌──────────┐ ┌──────────────────────┐
│ Ideas │ │ 1. Read program.md │
│ Papers │ ───────► │ 2. Modify target │
│ Goals │ │ 3. Commit │
└──────────┘ │ 4. Eval (timeout) │
│ 5. Keep or revert │
autoresearch.toml │ 6. autoresearch │
┌──────────┐ │ record --metric │
│ target │ ───────► │ 7. Repeat forever │
│ eval_cmd │ └──────────────────────┘
│ metric │ │
└──────────┘ ▼
.autoresearch/
experiments.jsonl
┌─────────────────────┐
│ run 0: 1.050 baseline│
│ run 1: 1.042 kept │
│ run 2: 1.055 discard │
│ run 3: 1.031 kept │
└─────────────────────┘
The CLI handles everything except the loop itself:
- Scaffolding —
initcreates the config and research prompt - Validation —
doctorchecks everything before you start - State management —
recordhandles JSONL atomically (agents never hand-write JSON) - Tracking —
log,best,diff,statusparse experiments from git + JSONL - Reporting —
reportgenerates a shareable markdown summary
The agent handles the creative work — deciding what to try, implementing changes, interpreting results.
Agent Integration
Supported Agents
| Agent | Format | Install path | Slash command |
|---|---|---|---|
| Claude Code | SKILL.md | ~/.claude/skills/autoresearch/ |
/autoresearch |
| Gemini CLI | SKILL.md | ~/.gemini/skills/autoresearch/ |
auto-discovered |
| Codex CLI | SKILL.md | ~/.codex/skills/autoresearch/ |
/autoresearch |
| OpenCode | SKILL.md | ~/.config/opencode/skills/autoresearch/ |
/autoresearch |
| GitHub Copilot | SKILL.md | .github/skills/autoresearch/ |
auto-discovered |
| Cursor | SKILL.md | .cursor/skills/autoresearch/ |
auto-discovered |
| Windsurf | SKILL.md | .windsurf/skills/autoresearch/ |
auto-discovered |
| Augment/Goose/Roo | SKILL.md | .agents/skills/autoresearch/ |
auto-discovered |
No skill? No problem. Run autoresearch guide to get the full methodology. The CLI coaches agents through hints in every response — skills enhance discovery but aren't required.
For Agent Developers
Every command supports --json and auto-detects piped output:
# Machine-readable capabilities
# Record experiment (agent calls this, never writes JSONL directly)
# Check state before next iteration
Exit codes are semantic:
0— success1— runtime error (retry might help)2— config/usage error (fix config, don't retry)
Error JSON includes machine-readable codes and recovery suggestions:
Configuration
autoresearch.toml:
= "train.py" # The single file the agent may modify
= "python train.py" # Must print the metric to stdout
= "val_bpb" # What the metric is called
= "lower" # "lower" or "higher"
= "5m" # Max time per experiment
= "autoresearch" # Git branch for experiments
program.md is free-form — tell the agent what to explore, link papers, set constraints. The agent reads it between experiments for inspiration.
Multi-Direction Exploration
Fork: Parallel Branches
When you want to explore multiple directions simultaneously:
Creates autoresearch-fork-try-transformers, autoresearch-fork-try-convolutions, etc. from the current best. Start a separate agent on each:
&&
Review: Cross-Model Second Opinions
After running experiments, get a second model to review your progress:
|
Generates a structured review prompt with:
- Session summary (kept/discarded rates, baseline vs best)
- Pattern detection (stuck detection, repeated failure themes)
- Specific questions for the reviewer to answer
- Suggested next directions
Pipe to Codex or Gemini for cross-model insights that break local minima.
Watch: Live Dashboard
Monitor experiments in real time from another terminal:
Shows a live-updating dashboard with sparkline progress, kept/discarded rates, best metric, and new experiment notifications. Refreshes every 2 seconds (configurable with -i).
Merge Best: Pick the Winner
After forks finish exploring, compare them and find the winner:
Ranks all fork branches by best metric, shows a comparison table, and gives you the git commands to merge the winner and clean up losers.
What People Are Building
The autoresearch pattern works on anything with a measurable metric:
| Domain | Metric | Result |
|---|---|---|
| ML training | val_bpb | 126 experiments overnight, 11% improvement |
| Chess engine | Elo rating | Expert → Grandmaster (2718 Elo) |
| Bitcoin modeling | Prediction error | Halved error in one morning |
| Sudoku solver | Accuracy | Beat published paper (87% → 92.2%) |
| API latency | p99 ms | 37% reduction via KD-tree optimization |
| Trading bots | Score | 43,000% improvement via evolutionary loop |
Inspired By
- Karpathy's autoresearch — the original pattern (42K stars)
- uditgoenka/autoresearch — generalized Claude Code skill (608 stars)
- ARIS — cross-model research pipeline
- ResearcherSkill — domain-agnostic research agent
- SkyPilot scaling — multi-GPU parallel autoresearch
License
MIT — 199 Biotechnologies