galdr 0.13.0 - Docs.rs

<p align="center">
  <img src="assets/banner.svg" alt="galdr — Record & Replay for agent skills" width="100%">
</p>

<p align="center">
  <a href="https://crates.io/crates/galdr"><img src="https://img.shields.io/crates/v/galdr.svg" alt="crates.io"></a>
  <a href="https://github.com/Arakiss/galdr/actions/workflows/ci.yml"><img src="https://github.com/Arakiss/galdr/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
  <img src="https://img.shields.io/badge/license-MIT%20OR%20Apache--2.0-blue.svg" alt="License: MIT OR Apache-2.0">
  <img src="https://img.shields.io/badge/rust-1.88%2B-orange.svg" alt="Rust 1.88+">
  <img src="https://img.shields.io/badge/egress-none%20·%20loopback--only-4FD6C9.svg" alt="No external egress, loopback-only">
</p>

# galdr

> _galdr_ — Old Norse for a chanted spell: a sequence performed once and sung again.

**Record & Replay for agent skills.** galdr records the *tool calls* an agent harness
already emits — not pixels, not the screen — stores them as a raw, immutable span, and
distills them into a reproducible skill. Everything is local: the raw lives in
`~/.galdr` and nothing leaves the machine.

The idea: instead of re-explaining to your agent how to do a task it already did well,
**record it once** and turn it into a skill it can replay with judgment.

<p align="center">
  <img src="assets/demo.gif" alt="galdr: record a task, distill it into a skill, surface opportunities, measure replay reliability" width="100%">
</p>

> [!NOTE]
> Around the core loop (record → distill → replay) galdr adds: a SQLite catalog (a
> rebuildable *index*, never the truth) with a supervisor daemon, an install-time
> content gate (blocks secrets, personal paths, dangerous commands), a terminal UI,
> safe export, diff-based **parametrization** of two recordings, **`galdr suggest`**
> (repeated tasks worth turning into a skill), **`galdr bench`** (per-skill replay
> hit-rate from recorded outcomes), and optional autonomous distillation against a
> local, loopback-only MLX server. A plain `cargo build` needs none of the MLX path —
> it is feature-gated and falls back cleanly to the faithful mechanical render.

## Why tool calls, not pixels

A GUI Record & Replay records what the screen looked like. It breaks when a button
moves. An agent already emits a clean, structured trace of *what it did*: each tool
call, its input, and its result. galdr records that substrate. The replay is not a
pixel re-enactment — it is a skill the agent reads and applies with judgment.

**The honest scope.** galdr records what *your agent* did, not what *you* did by hand
outside it. Note this is narrower than it sounds: when the agent drives a browser
through a tool — a Playwright/Chrome MCP server, a browser tool — those clicks, types,
and navigations **are** tool calls, so galdr already captures and distills them like
any other step. The only thing out of scope is capturing a *human's* manual gestures
in a browser the agent never touched; that would need a separate pixel/DOM capture
layer and contradicts the "tool calls, not pixels" thesis, so it stays out of the core
(a roadmap item, and a job for the extension seam if ever). Sold honestly: this is
Record & Replay *for what your agent does* — including its web automation and the GUI
work it drives through Computer Use (those actions are tool calls too; galdr keeps the
action and drops the screenshot).

## How it works

```
agent session ──(PostToolUse)──▶ galdr hook ──append──▶ span (JSONL)
                                                              │
                                            galdr distill ◀───┘
                                                   │
                                                   ▼
                                      ~/.agents/skills/<name>/SKILL.md
```

1. **Sensor** (`galdr hook`) — invoked by the harness after each tool call. If a
   recording is active, it appends the event to the span. It is instantaneous and
   **always exits 0**: it never breaks the session, even if it fails internally.
2. **Recording** (`galdr rec start` / `stop`) — opens and closes the span, and writes
   the recording metadata.
3. **Distillation** (`galdr distill [reference]`) — a replay of the tool calls is not yet
   a skill. galdr renders a faithful draft from the span (real steps, secrets redacted)
   and hands the agent an authoring brief: read the span, supply the judgment galdr can't
   — the problem it solves, the inputs that vary, each step's intent, the gotchas — and
   install the authored skill with `--from`. galdr owns the mechanism; the agent owns the
   intelligence. `--fast` installs the mechanical render as-is in one step (a floor, for a
   human or a headless run); `--auto` lets a local MLX model author it.

## Install

```sh
# from crates.io
cargo install galdr

# or from source
cargo install --git https://github.com/Arakiss/galdr
```

Prefer a prebuilt binary? Every [release](https://github.com/Arakiss/galdr/releases/latest)
ships signed, checksummed binaries (Sigstore + SHA-256) and an SBOM for macOS and Linux
(arm64 + x86_64). No network egress at runtime — galdr only ever talks to loopback.

## Quickstart

```sh
galdr setup skill         # teach your harness(es) how to drive galdr (one time)

galdr rec start demo      # ● recording "demo" — now do the task with your agent
#  ... a few tool calls ...
galdr rec stop            # ■ stopped "demo" — 6 steps

galdr distill             # render a faithful draft of the most recent recording + an authoring brief
#  ... read the span, write the real skill, then ...
galdr distill --from skill.md   # install the skill you authored
```

galdr captures *what* ran; you supply *why*. The default hands you a faithful draft and a
brief — install your authored version with `--from`, or `galdr distill --fast` to accept
the mechanical render as-is (a floor, not a finished skill).

**You never copy a 26-character id.** Every command that takes a recording resolves it
the way you think about it — the most recent by default, or by name, or by a short id
prefix:

```sh
galdr distill demo        # draft the recording named "demo"
galdr show                # inspect the most recent recording's steps
galdr suggest             # repeated tasks worth turning into a skill
galdr bench               # how reliably your distilled skills actually replay
```

Run `galdr` with no arguments for a one-screen overview: whether anything is recording,
how many recordings and skills you have, and the next command to type.

## More commands

```sh
galdr daemon --detach        # run the supervisor daemon (catalog indexer + socket)
galdr daemon status          # check whether the daemon is answering
galdr daemon stop            # ask the daemon to shut down gracefully
galdr show [reference]       # inspect one recording (name, id prefix, or omit for the latest)
galdr skills                 # list installed skills, galdr/external origin, provenance, and readiness
galdr harnesses              # detect agent harnesses on this system and whether galdr's sensor is wired
galdr harnesses --json       # the same, machine-readable
galdr link                   # make distilled skills discoverable by every installed harness
galdr link --skill <name>    # link just one skill
galdr evaluations            # list skill evaluator outputs from the catalog
galdr evaluations --skill <name>   # show one skill's evaluator history
galdr outcome usage --skill <name> --outcome success   # --rec defaults to the latest recording
galdr outcome label --skill <name> --label accepted --evaluator human
galdr outcome list --skill <name>  # inspect captured usage/outcome labels
galdr suggest                # skill opportunities: repeated tasks not yet distilled
galdr suggest --min-count 1  # also surface single, undistilled recordings
galdr bench                  # replay reliability: per-skill hit-rate from recorded outcomes
galdr reindex                # rebuild the SQLite catalog from disk
galdr doctor                 # diagnose config, catalog, daemon, skills, and hook wiring
galdr setup claude --check   # check Claude Code PostToolUse hook wiring
galdr setup claude --print   # print the safe settings.json snippet
galdr setup codex --check    # check Codex PostToolUse hook wiring (~/.codex/hooks.json)
galdr setup codex --print    # print the safe Codex hooks snippet
galdr tui                    # lazygit-style browser: recordings, spans, skills, harnesses

galdr diff <a> <b>           # diff two recordings: constants vs parameters
galdr parametrize <a> <b> --emit   # write a parametrized SKILL.md (suffix -param)
galdr export [reference] --out ./export          # export metadata + summaries, no raw payloads
galdr export [reference] --out ./export --redact   # export a redacted raw copy

galdr distill --auto         # autonomous distillation of the latest recording (local MLX, see below)
```

galdr is two surfaces over one catalog, and **both serve humans and agents**. The CLI
gives a person warm, colorized output and a no-id-needed flow (resolve a recording by
the latest, a name, or a short prefix; `NO_COLOR` and non-TTY output are honored); it
gives an agent `--json` on every read command — `list`, `show`, `skills`, `evaluations`,
`harnesses`, `outcome list` — emitting a single parseable document, so a tool consumes
galdr without scraping a table:

```sh
galdr list --json | jq '.[].rec_id'
galdr skills --json | jq '[.[] | select(.origin == "galdr")]'
```

### Terminal UI

`galdr tui` is the browsable face — a lazygit-style cockpit over the same catalog, no
flags to memorize. Three panels (Recordings · Skills · Harnesses), moved with `1`/`2`/`3`
or `tab`, and a live preview that follows the selection:

- **Recordings** — `enter` steps into the span; `d` distills a complete skill; `e`
  exports it without raw payloads; `o` reveals the span path.
- **Skills** — `enter` reads the `SKILL.md` (scrollable); `l` links it into every
  installed harness; `v` validates it against the content gate; `O` records a success
  outcome (the signal `galdr bench` reads).
- `/` filters recordings and skills, `?` lists the keybindings, `q` quits.

Every action the TUI takes is **local and additive** — it distills, links, validates,
exports, and records outcomes. It never deletes a span, rewrites a skill behind your
back, or mutates your harness settings.

The daemon is optional. `list`/`show`/`skills` answer daemon-first, then from the
read-only catalog, then from a fresh in-memory index built straight from disk — so the
CLI works whether or not the daemon is running. Write paths also keep the local catalog
fresh best-effort, so a closed recording or newly installed skill is visible without
waiting for a daemon process.

`galdr doctor` checks the local root, config, daemon socket, active recording, rebuildable
catalog, skill provenance, draft skills, and Claude Code hook wiring. `galdr setup claude
--print` emits the safe hook snippet; it never mutates your settings file.

`galdr skills` is a small skill catalog, not just a name list. It reports each
skill's provenance, lifecycle status (`draft`, `final`, `param-draft`, `unknown`), a
simple 0-100 readiness score, and the latest delta versus the previously indexed
version. The score is intentionally explainable: frontmatter, required sections, draft
markers, and provenance. It is a lint/readiness guardrail, not a claim that the skill is
objectively good.

Evaluator outputs are stored separately from the skill row and can be inspected with
`galdr evaluations`. Today's built-in evaluator is `readiness_lint`; future evaluators
can add human review, LLM review, outcome evidence, or a learned model without mixing
those signals into one opaque "quality" number.

`galdr outcome` is the supervised-data lane. It records append-only local JSONL events
under `~/.galdr/outcomes/` and indexes them into the catalog:

- `galdr outcome usage` records that a skill was used in a later recording, with task
  kind, outcome, retry count, manual intervention count, and notes.
- `galdr outcome label` records explicit labels or reviews such as `accepted`,
  `rejected`, `needs_review`, `regression`, or any project-specific label.
- `galdr outcome list` shows the captured usage and labels.

This keeps learned-model training out of the agent loop while collecting the labels a
future offline classifier or ranker needs: skill content/hash, provenance, task context,
observed outcome, retries, interventions, and reviewer labels.

### Optional: autonomous distillation (local MLX)

`galdr distill --auto` (optionally with a recording reference) lets a local model write
the finished skill instead of the agent. It is **off by default** and **loopback-only** —
it never reaches off the machine.

1. Build with the feature: `cargo install --path . --features mlx`.
2. Run a local OpenAI-compatible server, e.g. `mlx_lm.server` from
   [mlx-lm](https://github.com/ml-explore/mlx-lm):
   `python3 -m mlx_lm.server --model mlx-community/Qwen3-4B-Instruct-2507-4bit`
   (the default endpoint `http://127.0.0.1:8080` and model are configurable in
   `~/.galdr/config.json`).
3. `galdr distill --auto` (or `--engine mlx-subprocess` to shell out to
   `python3 -m mlx_lm.generate`, which needs no `mlx` feature).

If the engine is missing or unreachable, `--auto` falls back to the deterministic
complete skill and still exits 0. Always review a machine-generated skill before relying
on it.

## Wiring the sensor

For the sensor to receive events, the harness must invoke `galdr hook` in its
*after-each-tool* hook, passing the event on stdin. In Claude Code, add a `PostToolUse`
entry to `~/.claude/settings.json` — hook arrays are concatenated, so it coexists with
any other hooks you have:

```json
{
  "hooks": {
    "PostToolUse": [
      {
        "hooks": [
          { "type": "command", "command": "galdr hook" }
        ]
      }
    ]
  }
}
```

The sensor reads the harness fields from stdin (`tool_name`, `tool_input`,
`tool_response`, `cwd`, `session_id`, `transcript_path`) and is a no-op when no
recording is active.

If you prefer a resilient command that finds `galdr` on `PATH` and falls back to the
cargo bin, that form works too and is recognized by `galdr doctor` / `setup claude
--check`:

```sh
if command -v galdr >/dev/null 2>&1; then galdr hook; \
elif [ -x "$HOME/.cargo/bin/galdr" ]; then "$HOME/.cargo/bin/galdr" hook; fi
```

## Multi-harness: one skill, every harness

galdr distills a skill once, into the open-standard skills root (`~/.agents/skills`),
then makes it discoverable in **every harness installed on the machine**. Each harness
loads skills from its own directory, so galdr links the canonical skill into each one:

| Harness | Skills directory | Sensor wiring |
|---|---|---|
| Claude Code | `~/.claude/skills` | `~/.claude/settings.json` PostToolUse hook |
| Codex | `~/.codex/skills` | `~/.codex/hooks.json` (same hook shape) |
| Cursor | `~/.cursor/skills-cursor` | — |

The link is a symlink back to the canonical copy (the same mechanism a hand-linked
skill already uses), created on install, never clobbering a real file of the same
name. `galdr harnesses` shows what's installed; `galdr link` (re)links every skill;
`galdr doctor` flags any galdr skill a harness can't see. The result: record a task in
one harness, get a reusable skill in all of them.

## On-disk layout

```
~/.galdr/
├── active                      active recording (JSON); absent = not recording
├── config.json                 optional config (distill engine, endpoint, model)
├── galdrd.sock                 daemon control socket (NDJSON over a Unix socket)
├── galdrd.pid                  daemon pidfile
├── catalog.sqlite              queryable index, rebuilt from spans/ + recordings/ + skills
├── spans/<rec_id>.jsonl        append-only span, one JSON line per tool call
├── outcomes/skill_usage.jsonl  append-only skill usage observations
├── outcomes/skill_outcomes.jsonl append-only skill labels and reviews
└── recordings/<rec_id>.json    metadata for each closed recording
```

The span is the raw source of truth: append-only, immutable, inspectable. The SQLite
catalog is an **index, never the truth** — it stores one-line step summaries (no raw
blobs), readiness/evaluation rows, and outcome indexes. `galdr reindex` rebuilds it from
spans, recordings, skills, and outcome logs at any time.

The root is `~/.galdr` by default. Set `GALDR_ROOT` to relocate it (and
`GALDR_SKILLS_ROOT` to relocate the skills directory) — useful for hermetic tests,
throwaway profiles, CI, or to keep the daemon's Unix socket path under the platform's
length limit.

`config.json` may also include optional capture policy for future recordings:

```json
{
  "capture": {
    "deny_tools": ["SecretTool"],
    "deny_cwd_prefixes": ["/private/project"],
    "max_response_chars": 4000,
    "strip_screenshots": true
  }
}
```

The policy only applies to new events as the hook records them. It never edits spans
that already exist.

`strip_screenshots` (on by default) drops base64 image blobs — a Computer Use
screenshot, an image content block — from the recorded event, keeping the *action*
(click, type, key) but not the pixels. The pixels are large and may show sensitive
on-screen content, and they are never the reusable signal; the action is. This is what
lets galdr record a GUI task the agent drives via Computer Use and distill it into a
clean, semantic skill — the agent's GUI work is just more tool calls.

## Extension points

The core exposes two generic seams (`src/ext.rs`) and nothing more:

- **`PermissionGate`** — decides whether an event may be recorded. Allows everything by
  default.
- **`ProvenanceSink`** — observes recorded events. Does nothing by default.

Any concrete integration (permission policy, traceability) plugs in by implementing
these traits from outside the repository.

## Security & privacy

The span captures each tool's `tool_input` and `tool_response`, which **may contain
sensitive data** (file contents, commands, outputs). Keep that in mind before recording
and before sharing a recording or a distilled skill. See [SECURITY.md](SECURITY.md).

Design guarantees:

- The raw lives **only** in `~/.galdr`, on your machine. galdr makes **no external
  network egress**. The one optional exception — autonomous distillation (`--auto`,
  feature `mlx`) — talks **only to loopback**, enforced in code by
  `engine::validate_loopback`; a non-loopback endpoint is a hard error.
- The sensor never propagates errors to the agent session, and never depends on the
  daemon: it appends to the span first, then hints the daemon best-effort.
- The autonomous distiller treats the recorded span as untrusted data (delimiter, low
  temperature, output validation, human review). Default distillation is still done by
  your own agent, locally.

## Roadmap

Phase 1 (shipped, built on the Phase 0 loop):

- ✅ **Supervisor daemon** + Unix socket + **SQLite** as a rebuildable catalog.
- ✅ **TUI** to browse recordings, inspect spans, and audit skill provenance.
- ✅ **Diff-based parametrization** of two recordings (constant = steps, variable = inputs).
- ✅ **Autonomous distillation** against a local, loopback-only MLX server.
- ✅ **Skill catalog readiness signals** with lifecycle status, provenance, score, delta,
  and an evaluator table for future human/LLM/model reviews.
- ✅ **Operational diagnostics** (`doctor`, `rec status`, daemon status/stop, Claude setup
  check/print).
- ✅ **Safe export path** that omits raw payloads by default and can emit redacted raw
  copies without touching the original span.

Phase 2 (shipped):

- ✅ **Faithful render** — `galdr distill` renders a valid skill from the span in the
  open-standard `When to use` / `Inputs` / `Steps` / `Verification` anatomy. `--fast`
  installs it as-is; `--auto` lets a local MLX model author it.
- ✅ **Multi-harness discoverability** — a distilled skill is linked into every installed
  harness's skills directory (Claude Code, Codex, Cursor); `galdr link` / `doctor` manage it.
- ✅ **Multi-harness sensor** — `galdr setup codex` wires the same hook into Codex's
  `hooks.json`; `galdr harnesses` shows which harnesses are wired.
- ✅ **Session-scoped recording** — a recording binds to the session that started it, so a
  concurrent session in another project can't leak its tool calls into the span.
- ✅ **AI-first CLI** — `--json` on every read command.

Phase 3 (shipped) — a surface humans and agents both want to use:

- ✅ **Author by default** — a replay of the tool calls is not a skill. `galdr distill`
  now renders a faithful draft and hands the agent an authoring brief (supply the why,
  the generalized inputs, the gotchas), installed with `--from`; an unauthored draft
  scores lower until it is raised. `--fast` keeps the mechanical one-shot for a floor.
- ✅ **No ids to copy** — every command resolves a recording by the most recent, a name,
  or a short id prefix. `galdr distill` distills the run you just made.
- ✅ **A friendly home screen** — `galdr` with no arguments shows where you are and the
  next command to type; output is colorized, TTY-aware, and honors `NO_COLOR`.
- ✅ **`galdr suggest`** — signs every recording by the shape of its steps, dedupes
  against installed skills, and ranks the repeated tasks worth distilling.
- ✅ **`galdr bench`** — aggregates recorded outcomes into a per-skill replay hit-rate
  and effort cost; the production signal a capability test cannot give.
- ✅ **A lazygit-style TUI** — three panels, a live preview, and in-place actions
  (distill, link, validate, export, record outcome), all local and additive.

Next:

- **Capture of human GUI gestures** — the deliberate scope gap above. The agent's *own*
  browser automation is already captured (it arrives as tool calls). What's missing is
  recording a *human* driving a browser by hand, which needs a separate pixel/DOM capture
  layer on top of the span model — the one axis where Codex's pixel recorder does something
  galdr does not, and a job for the extension seam rather than the core.
- Verify the Codex sensor end to end with a live Codex recording (the wiring is in place;
  the stdin payload is not yet confirmed field-for-field).
- A multi-agent broker over the same span model.
- Real gates and real provenance plugged into the extension layer (`PermissionGate`,
  `ProvenanceSink`) — where harness-specific policy or memory integrations live, kept out
  of the local-first core.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). Commits follow
[Conventional Commits](https://www.conventionalcommits.org/); the project uses
[Semantic Versioning](https://semver.org/).

## License

Licensed under either of [Apache License, Version 2.0](LICENSE-APACHE) or the
[MIT license](LICENSE-MIT) at your option — the standard Rust dual-license, so you
pick whichever fits.

Unless you explicitly state otherwise, any contribution intentionally submitted for
inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual
licensed as above, without any additional terms or conditions.