basinski 0.1.0

Rescues media files from their own disintegration. Named for William Basinski's Disintegration Loops.
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

`basinski` is a single-binary Rust CLI that rescues damaged media files (mostly
head- and tail-truncated MP4s) by reconstructing the structure ffmpeg refuses to open.
The user-facing field guide lives in `README.md` — read it for the *why* and the
operator workflow; this file is the *how the code fits together*.

## Commands

```sh
cargo build --release        # binary at target/release/basinski
cargo test                   # unit tests (embedded #[cfg(test)] modules)
cargo test forensics::       # run one module's tests; or `cargo test <name>` for one test
bash tests/e2e.sh            # full round-trip: generate→head-truncate→rescue→verify (see below)
cargo clippy                 # lint
cargo fmt                    # format
basinski completions zsh     # print a shell-completion script (bash/fish/etc. too)
```

Shell completions are generated by `clap_complete` from the clap definition in
`src/main.rs` — they stay in sync automatically. Richness comes from the
doc-comment help on every arg plus `ValueHint::FilePath` on path arguments; keep
both when adding flags so completions stay descriptive.

`tests/e2e.sh` is the real integration test and the best end-to-end smoke check:
it synthesizes an mp4 with ffmpeg, head-truncates it two ways, cuts 60% off the tail,
and strips a clip down to a bare headless payload, then rescues all four
(surgical, transplant, and `divine`) and asserts bit-fidelity / zero decode
errors. It requires `ffmpeg`+`ffprobe` **with libx264** on PATH. Run from repo
root; override the binary with `BIN=...`.

## External dependencies

`ffmpeg`/`ffprobe` are the *only* runtime externals, and they are quarantined in
`src/ffx.rs` — every shell-out to them goes through that module. Everything else
(parsing, bit-splicing, index reconstruction) is hand-rolled Rust on purpose. If
you need ffmpeg behavior, add a wrapper in `ffx.rs` rather than spawning
processes elsewhere. `identify` works without ffmpeg; `divine` additionally needs
libx264 for donor synthesis.

## Architecture

Edition 2024. `src/main.rs` is a clap-derive subcommand dispatcher (`identify`,
`rescue`, `clip`, `divine`, `convert`); it does argument plumbing only and hands
off to module entry points. Modules form a dependency stack from raw bytes up:

- **`forensics.rs`** — the foundation. Scanners that hunt for *codec-level*
  structure (frame-sync chains, NAL units, atom skeletons) *anywhere* in a
  buffer, not just magic bytes at offset 0. `identify()` runs them all and
  returns confidence-ranked `Finding`s. Also exports the byte-level helpers
  (`scan_atoms`, `avcc_chain_len`, `KNOWN_ATOMS`) the reconstruction modules
  reuse.
- **`mp4.rs`** — ISO BMFF box walking and the *surgical* rescue: when a head-truncated
  MP4's `moov` index survives at the tail, work out exactly how many bytes `K`
  were lost (via mdat-header anchor or stsz/NAL-chain correlation) and regrow a
  prosthetic `ftyp`+`free` prefix so every absolute offset in the index points
  true again. `K` is verified against the data (≥80% of sampled samples must
  parse as valid NAL chains) before any reconstruction.
- **`h264.rs`** — H.264 bitstream surgery: Exp-Golomb read/write, emulation-
  prevention strip/reinsert, SPS parse *and patch*. Used to manufacture donor
  parameter sets whose field widths match streams no living encoder emits.
- **`transplant.rs`** — untrunc-style recovery when the index is gone *entirely*.
  Harvests a `--reference` sibling's organs (`stsd`/`tkhd`/timing) and rebuilds
  the index by walking the orphaned `mdat` one NAL unit at a time. Recovers
  B-frame display order (`ctts`) by parsing `pic_order_cnt_lsb` from slice
  headers. PCM audio splits exactly; AAC is dropped here (no sync words).
- **`divine.rs`** — when there's no index *and* no donor: brute-force the lost
  codec parameters. Synthesize a candidate donor per guess, decode the stream's
  own keyframe under it, and score the result. Three escalating stages
  (geometry/entropy → header semantics → optional neural). Writes
  `<name>.donor.mp4`, which the operator then feeds to `rescue --reference`.
- **`gestalt.rs`** — divine's judge: "does this buffer of pixels look like a
  *picture*?" Numerical proxies for the human glance (edge kurtosis, macroblock
  seam ratio, chroma sanity, inter-frame coherence), plus an optional tiny ONNX
  classifier (`tract-onnx`) that re-ranks survivors.
- **`aac.rs`** — raw-AAC audio salvage from the gaps *between* video samples,
  used by the transplant/rescue path. Recognizes the recurring CPE element
  header, wraps each gap in a synthetic ADTS header, lets the decoder walk it.
- **`rescue.rs`** — the orchestrator. Diagnoses the damage from forensics, routes
  to surgical (`mp4`) vs transplant (`transplant`) vs empirical (trim-to-sync)
  recovery, salvages audio (`aac`), validates the decode, and clips to the first
  clean keyframe. This is where the subcommand flags become a pipeline.

The subcommands are a **ladder of escalating desperation**, each step engaging a
deeper module: `identify` → `clip` (file opens, just artifacts) → `rescue`
(surgical, moov survived) → `rescue --reference` (transplant, index gone but you
have a sibling) → `divine` then `rescue --reference` (manufacture the sibling).
The README's mermaid flowchart is the canonical decision tree.

## Conventions worth knowing

- The whole input file is read into memory. Fine for phone videos; do not assume
  streaming.
- `convert`/`rescue --the-correct-format` is opinionated by design: the only
  output formats are mp4 (H.264+AAC) for video and mp3 for audio. The
  `--the-correct-format` flag is mandatory consent, not a no-op — `convert`
  refuses without it.
- Recovery code is forensic: prefer verifying a reconstruction against the
  surviving bytes (decode-error counts, NAL-chain parse rates, gestalt scores)
  over trusting a single signal. New recovery paths should fail *loudly and
  honestly* when evidence is insufficient — see the README's "Honest
  limitations" for the contract.
- `samples/` is gitignored, machine-local casework. `samples/NOTES.md` is the
  running lab notebook for real-file rescues (also tracked in auto-memory).

## Contributing, CI & releases

- **`main` is protected: no direct pushes** (enforced for admins too). All
  changes land via PR, **squash-merged**. The squash commit takes the PR title,
  so **PR titles must be Conventional Commits** (`feat:`, `fix:`, `ci:`, …) —
  enforced by the `pr-title` workflow, which is a required check.
- **Conventional commits locally**: hooks are managed by
  [lefthook]https://lefthook.dev (`lefthook.yml`). Run `lefthook install` once
  per clone (`brew install lefthook` first); a `commit-msg` hook mirrors the
  PR-title gate. It's a convenience — the server-side check is the real gate.
- **CI** (`.github/workflows/ci.yml`): `fmt --check`, `clippy -D warnings`, and
  `cargo test` are the required checks; an `e2e` job (installs ffmpeg, runs
  `tests/e2e.sh`) runs but is intentionally non-required (the brute-force
  `divine` step has some variance).
- **Releases are automated by [release-plz]https://release-plz.dev**: merging
  conventional commits to `main` keeps a release PR open (version bump +
  changelog); merging *that* publishes to crates.io and cuts a GitHub release,
  which triggers `release.yml` to build the 5-target binary matrix and attach
  `basinski-<target>.{tar.gz,zip}` assets (the naming `ubi`/`mise` expect).
  Requires the `CARGO_REGISTRY_TOKEN` and `RELEASE_PLZ_TOKEN` repo secrets.