rivet-cli 0.16.2

# PostgreSQL DB-harm matrix — what each tool does to the SOURCE

Throughput and RSS measure the *tool*. This measures what the tool does to the
*source database* while it extracts the wide `content_items` fixture (1,928,176
rows × 20 cols). Most ingestion benchmarks never look at this — it's where a
chunked, budgeted reader pulls away from a single `SELECT *`.

Generated by `harness/pg_harm_matrix.py`: samples `pg_stat_activity` every 250 ms
during each tool's run (peak active backends, longest single in-flight query) on
the warm cache, same box.

**Tool versions (all latest as of this run):** rivet 0.14.0 · ingestr 1.0.44 ·
sling 1.5.20 · duckdb 1.5.4 · odbc2parquet 11.0.1.

## The headline metric: snapshot hold-time

`longest query (s)` = how long the tool pinned a single snapshot. A `SELECT *` tool
holds **one transaction open for the entire extraction** — that blocks `VACUUM` and
pins dead tuples for the whole run, the real production harm. A chunked reader holds
only one short query at a time.

### Defaults

| tool | wall (s) | peak backends | longest query (s) | temp spilled (MB) |
| --- | --- | --- | --- | --- |
| **rivet (chunked)** | 63.1 | **1** | **1.2** | 0 |
| ingestr | 36.9 | 1 | 34.8 | 0 |
| duckdb | 37.3 | **10** | 9.7 | 0 |
| odbc2parquet | 106.2 | 1 | 35.4 | 0 |
| sling ⚠️ failed (rc=1) | 125.6 | 1 | 125.2 | 0 |

### Steelman — each `SELECT *` tool given its best fetch config

| tool | wall (s) | peak backends | longest query (s) | temp spilled (MB) |
| --- | --- | --- | --- | --- |
| **rivet (chunked)** | 63.3 | **1** | **0.8** | 0 |
| ingestr `--page-size 100000` | 41.4 | 1 | **39.0** | 0 |
| duckdb | 35.4 | **10** | 4.9 | 0 |
| odbc2parquet `--batch-size-memory 1Gib` | 65.1 | 1 | **34.3** | 0 |
| sling ⚠️ failed (rc=1) | 125.5 | 1 | 124.9 | 0 |

## What it shows (read honestly)

- **rivet holds the source ~1 second; every competitor holds it 5–125 seconds** —
  for most, the *entire run* is one transaction. That long-held snapshot is what
  hurts a busy production primary (VACUUM blocked, dead tuples pinned, bloat).
- **The steelman proves this is STRUCTURAL, not a tuning gap.** Giving ingestr a
  bigger page and odbc2parquet a bigger batch did **not** move the hold (34.8→39.0,
  35.4→34.3) — those knobs size the client-side fetch buffer, not the source query.
  Only **rivet (keyset chunking)** and **duckdb (ctid-parallel)** issue *many* source
  queries; the rest issue *one* `SELECT *` and there is no flag to change that.
- **duckdb gets a shortish hold (≈5–10 s) — but via 10 backends.** rivet is the only
  tool that gets a sub-second hold with a **single** connection (short hold *and*
  light on connection count).
- **What we deliberately do NOT claim:** `temp spilled` is 0 for everyone (no
  `ORDER BY` → no sort/hash spill here) — not a differentiator. A disk-blocks metric
  was dropped: `pg_stat_database` is database-wide and under-counts cursor/parallel
  reads, so it was noise (and happened to *disfavour* rivet) — not an honest signal.
- **sling failed (rc=1) on the full 1.93 M wide table**, at defaults and steelman
  alike — it held a snapshot ~125 s and then errored. An honest data point: it did
  not complete this extraction at this scale.

## Trade-off, stated plainly

rivet spends ~1.7× the wall time of the fastest competitor (chunked keyset has more
round-trips) to hold the source snapshot **~30–100× shorter, with one backend**. On
a production primary that nobody can take offline, the snapshot hold-time is the
metric that decides whether an extract is safe to run — and it's the one a `SELECT *`
tool cannot configure away.

## Caveats
macOS arm64, single box (not the Linux bench machine). 250 ms sampling resolution.
rivet uses `mode: chunked` (byte budget) — its recommended source-safe mode.