llkv-column-map 0.8.5-alpha

# ColumnStore GC (Reclaim) Notes

This summarizes a pragmatic, low-risk approach to reclaiming space in the append-only `ColumnStore`. It assumes the current design:

* Writes create **segments** (index + data blob).
* Deletes are **tombstones** (variable layout entries with zero-length values).
* Reads are **correct** without GC: newest-first shadowing and `[min,max]` pruning; zero-len values = “missing”.

GC here is purely an optimization for disk usage and read I/O, not correctness.

---

## Goals

* **Reclaim space** by removing segments that are completely shadowed by newer data/tombstones.
* **Reduce read cost** when many overlapping segments exist.
* **Never load “everything”**: operate with strict per-run budgets; fetch only a few typed segments at a time.
* **Crash-safe & simple**: prefer safe leaks over risk of breaking reads.

---

## When is GC needed?

* You can **run for a long time with no GC**. Reads prune by `[min,max]` and stop after the newest hit (value or tombstone).
* Start GC when you observe:

  * Growing **segments per column** (esp. overlapping ranges).
  * Rising `io_stats.get_typed_ops` for the same queries (keys match multiple segments before a hit).
  * Storage pressure.

---

## Strategy: Index-Only, Budgeted “Fully Shadowed” Reclaim

**Key idea:** Decide reclaimability using **typed index data only** (logical keys + layouts). No data blobs required.

A segment **S\_old** is *fully shadowed* if, for **every key** in it, there exists a **newer** segment containing that key **or** a tombstone for it. If so, removing **S\_old** is safe.

### Why index-only?

* To check presence of a key you only need the newer segments’ **logical keys**.
* To detect tombstones you only need the newer segment’s **value layout offsets** (zero-length window). No need to touch data blobs.

---

## Per-Run Budget & Scheduling

Run GC **opportunistically** with a **small budget**:

* Trigger after `append_many` **only if** a column’s segment count exceeds a threshold, or run on a timer/maintenance thread.
* Each run touches at most:

  * **N columns**, starting with those having the most segments.
  * **M candidate segments per column** (oldest first).
  * **K newer segments per candidate** (newest first) during checks.
  * **T keys per candidate** (optional sampling/early-exit).

This keeps memory and I/O bounded and prevents “load everything”.

---

## Safe Update Ordering

When reclaiming a **fully shadowed** segment:

1. **Update the ColumnIndex** (remove the segment ref) and **persist** it.
2. **Free** the segment’s **index** and **data** physical keys.

If a crash happens after (1) but before (2): safe leak.
If you reversed the order you could break reads—so don’t.

---

## Pseudocode (sketch)

```rust
struct ReclaimOpts {
    max_columns: usize,              // how many columns to consider this run
    max_candidates_per_col: usize,   // oldest segments to test per column
    max_newer_to_check: usize,       // how many newer segments to consult (newest-first)
    max_keys_to_check: Option<usize>,// Some(N)=>sample/early-exit; None=>check all keys
    min_shadowed_fraction: f32,      // e.g. 1.0 for “fully”; or <1.0 if you later add partial rewrite
}

fn reclaim_budgeted<P: Pager>(store: &ColumnStore<P>, opts: ReclaimOpts) {
    // 1) Pick columns with largest segment counts.
    let columns = pick_columns_by_segment_count(store, opts.max_columns);

    for fid in columns {
        // 2) Oldest-first candidates.
        let segs = store.colindex(fid).segments.clone();
        for cand in segs.iter().rev().take(opts.max_candidates_per_col) {
            // 3) Collect up to K newer segments that overlap [min,max] by key-range.
            let newer = newest_overlapping_segments(&segs, cand, opts.max_newer_to_check);

            // 4) Load typed index data for {cand} + {newer}. No data blobs.
            let typed = batch_load_index_segments(newer.iter().chain([cand]));

            // 5) For each key in cand (or a sample):
            //    - binary_search across newer (newest-first)
            //    - count as shadowed if found OR if found with zero-length (tombstone).
            let shadowed_ratio = check_shadowed_fraction(&typed, cand, opts.max_keys_to_check);

            if shadowed_ratio >= opts.min_shadowed_fraction {
                // 6) Persist ColumnIndex without cand, then free cand.index_pk & cand.data_pk.
                persist_colindex_without(store, fid, cand.index_physical_key);
                store.do_frees(&[cand.index_physical_key, cand.data_physical_key]);
                // budgeted: optionally stop early if we hit a per-run free limit
            }
        }
    }
}
```

**Notes**

* `newest_overlapping_segments` uses `[min,max]` to prune aggressively.
* `check_shadowed_fraction` uses binary search with the keys from `cand` against each newer segment’s key layout; tombstone detection = `a == b` in `value_offsets`.
* Start with `min_shadowed_fraction = 1.0` (only full reclaims). Partial compaction is optional future work.

---

## Performance Characteristics

* **Memory:** bounded by the number of typed segments you load concurrently (1 candidate + K newer). No data blobs are fetched.
* **I/O:** one batched typed get for that small set, plus a single typed put (ColumnIndex) and 2 frees when reclaiming.
* **Time:** roughly `O(|keys_in_candidate| * log K)` per candidate (binary search over ≤K newer segments).

---

## Practical Defaults (good starting points)

* `max_columns`: 4–16
* `max_candidates_per_col`: 2–8 (oldest first)
* `max_newer_to_check`: 4–16 (newest first)
* `max_keys_to_check`: `None` (exact) or `Some(4_096)` with early-exit once “not fully shadowed” is proven
* `min_shadowed_fraction`: `1.0` (full reclaim)

These are **reclaim-only knobs** (unrelated to ingest or read paths).

---

## Metrics to Watch

Use `IoStats`:

* **`get_typed_ops`** during reads: stays ≈1/key when overlap is small; climbs as overlap grows.
* **`free_ops`** during GC runs: confirms reclaimed keys.
* Consider tracking “segments per column” and “bytes freed” (estimated from `describe_storage()`).

---

## What This GC Does Not Do (yet)

* **Partial compaction** (rewriting a candidate with only its live subset). That’s a follow-up once full-shadow reclaim is stable.
* **Global dedupe** across many segments at once.
* **Background threads**: you can still run it periodically or after `append_many` when segment count crosses a threshold.

---

## TL;DR

* You don’t **need** GC for correctness; reads are already correct and efficient when overlap is small.
* Add a **budgeted, index-only** reclaim pass that removes **fully shadowed** old segments.
* Run it **occasionally** (threshold-triggered or scheduled) with small budgets so it never tries to load “everything.”
* Update ColumnIndex **first**, then free pkeys—safe and simple.