Skip to main content

Module gc

Module gc 

Source
Expand description

Two-phase mark-and-sweep garbage collection for orphan packs (issue #66, Phase 5 of #52).

Orphan packs are pack files in <prefix>/packs/ that no chain.json references. They accumulate from:

  • Force push: replaces a chain’s segments; old packs become orphan.
  • Lost-race push: a pre-lock pack upload by the loser of a concurrent push (Phase 2 design — packs upload pre-lock to keep the lock window short, and the loser’s pack is left orphan).
  • Aborted push: a crash between pack upload and chain.json commit leaves orphans the next push doesn’t reach.
  • Branch deletion: delete-branch removes chain.json and path-index.json but does not touch <prefix>/packs/. The issue umbrella’s “exclusively owned by that branch” claim is wrong under content-hash dedup; pack keys can be shared across branches that ever pushed identical object sets. The baseline bundle (<prefix>/<ref>/<full_at>.bundle) is tombstoned rather than deleted synchronously (issue #143), so an in-flight fetcher that already read the prior chain.json can still complete its range GET; the bundle is reclaimed by sweep after the grace window.
  • Compaction (when implemented): a chain rewrite leaves the superseded segment packs orphan.
  • Missing .idx (rare): a .pack whose sibling .idx was manually deleted is treated as orphan and tombstoned.

§Two-phase mark-and-sweep

Naive deletion (“delete every pack older than 24 h”) races a concurrent fetch on a freshly-orphaned pack: the pack’s last_modified reflects upload time, not orphan time. The mark/sweep split fixes this by tombstoning at orphan time and deferring deletion until after a configurable grace window.

§Phase 1 (mark)

  1. List <prefix>/packs/ to snapshot the packs currently on the bucket. Packs-first is deliberate (issue #135): see “Concurrency” below.
  2. List <prefix>/refs/**/chain.json across every ref namespace (refs/heads/, refs/tags/, refs/notes/, etc.), parse each, collect referenced pack content-shas.
  3. Fail closed on parse error: abort, log the bad key, do not write tombstones. A corrupt chain could under-report the referenced set and tombstone live packs.
  4. Derive the orphan set (on_bucket - referenced) and write <prefix>/gc/tombstones-<run_id>-<rfc3339>.json.

§Phase 2 (sweep)

  1. List <prefix>/gc/tombstones-*.json.
  2. For each tombstone past the grace age:
    • Re-derive the orphan set from the current chain state. Repeated per tombstone, not cached across the sweep: a concurrent push committing chain.json mid-sweep would let a cached snapshot delete a pack the new chain references, permanently dangling the reference (issue #140). Force-revert is the canonical trigger — deterministic gix pack emission lets the new push reuse the tombstoned pack key without re-uploading. The cost is one list("refs/") per eligible tombstone vs one per sweep; correctness wins over the linear overhead for the O(1)-eligible-tombstones common case.
    • For each pack still orphan, delete .pack + .idx idempotently (a prior partial sweep is fine).
    • Delete the tombstone itself.
  3. Younger tombstones survive for the next sweep.

§Baseline-bundle tombstones (issues #134, #143)

Baseline bundles at <prefix>/<ref>/<full_at>.bundle are NOT reapable by the mark/sweep flow above — they live outside <prefix>/packs/, so [list_pack_shas] never sees them. The compact, force-push, and delete-branch code paths instead enqueue a baseline tombstone at <prefix>/gc/baseline-tomb-<uuid>.json whenever they supersede or remove a baseline. Sweep processes those alongside pack tombstones: after the grace window expires it re-checks the current chain.json for the ref (skipping the delete if a later push re-baselined to the same SHA), then deletes the bundle and the tombstone. The bundle stays in place for the entire grace window, so a concurrent fetch that read the prior chain.json before the compact/force-push committed can still download it.

§--force

Skips ONLY the grace window. The live-pack re-check still runs: a tombstone whose SHA appears in the current chain set is left alone. This closes the race where mark() snapshots packs after a concurrent push has uploaded packs/<sha>.{pack,idx} but has not yet committed chain.json — by sweep time the chain has landed and the pack is live, so the stale tombstone must not delete it. A tracing::warn! line records the operator’s choice.

§Concurrency

Two operators running gc simultaneously each get a UUIDv4 run id → distinct tombstone files, no clobber. Concurrent sweeps tolerate NotFound on already-deleted packs.

Mark lists packs first, then chains (issue #135). With this order, a push landing during mark either:

  • uploaded its pack after [list_pack_shas] — the pack is not in the on-bucket snapshot, so it cannot enter the orphan set regardless of when its chain commits; or
  • uploaded its pack before [list_pack_shas] AND committed chain.json before [list_referenced_packs] — the pack is in the referenced set, so it is filtered out of orphans; or
  • uploaded its pack before [list_pack_shas] and has not yet committed chain.json by the time [list_referenced_packs] runs — the pack is tombstoned, but the grace window leaves it readable long enough for the push to complete (the genuine-orphan case for an aborted push is exactly what the GC is designed to reap).

The reverse order (chains-first) is the bug fixed by #135: a chain commit landing between the chain list and the pack list would let a freshly-uploaded pack appear in [list_pack_shas] without appearing in [list_referenced_packs], producing a false-positive tombstone. Sweep’s per-tombstone re-derive (issue #140) would usually catch that at sweep time, but a --force sweep run in the same session as mark (e.g. compact --with-gc) could still delete the live pack before the push’s chain commit lands.

The grace window separately covers a fetch reading an old chain whose packs are about to be swept.

Structs§

MarkOpts
Knobs for mark.
MarkOutcome
Outcome of mark.
SweepOpts
Knobs for sweep.
SweepOutcome
Outcome of sweep.

Constants§

DEFAULT_GRACE_HOURS
Default grace window between mark and sweep (24 hours). A pack tombstoned during mark is only deletable after this duration has elapsed since marked_at.
TOMBSTONE_SCHEMA_VERSION
On-bucket schema version this build reads and writes.

Functions§

mark
Run the mark phase: snapshot every pack on the bucket, then every chain, then write a tombstone naming the orphans.
sweep
Run the sweep phase: walk tombstones, delete eligible orphans.