rivet-cli 0.16.0

# Cross-tool benchmark

How Rivet compares to five other Postgres / MySQL → Parquet extractors
(`sling`, `dlt`, `duckdb` postgres/mysql scanner, `clickhouse-local`, `odbc2parquet`)
on the same workload, and how to reproduce the numbers yourself.

The harness here measures two things:

1. **Wall / RSS / CPU / output** per (tool, table) — every tool produces the
   same Parquet file from the same source row set, captured via `gtime -v`.
2. **DB-side signals** during the run — `pg_stat_database` /
   `pg_stat_activity` deltas on PG, `SHOW GLOBAL STATUS` /
   `information_schema.processlist` on MySQL. Reveals what each tool's wire
   pattern looks like from the database's seat (long queries, temp spill,
   concurrent backends, query templates).

## Headline (latest run)

| Axis | PG winner | MySQL winner |
|---|---|---|
| Total wall (all tables) | clickhouse, ~150 s | rivet, ~180 s |
| Total user CPU | rivet, ~30 s | rivet, ~30 s |
| Peak RSS | rivet, ~440 MB | rivet, ~580 MB |
| Longest single query (content_items) | rivet, ~0.06 s | rivet, ~4 s |
| temp_bytes / Created_tmp_disk_tables | rivet, 0 | rivet, near-baseline noise |
| Failures across all tables | 0 (rivet, sling, clickhouse) | 0 (all five) |

Rivet wins or ties on every **source-friendliness** axis on both databases.
On total wall, `clickhouse-local` is faster on PG (one big `COPY (SELECT …)
TO STDOUT` per table); on MySQL Rivet is fastest because its size-capped
row_buf decouples per-flush RAM from chunk size.

Full numbers in [`reports/`](reports/):

- [`REPORT_pg.md`](reports/REPORT_pg.md) — Postgres, 6 tools × 22 tables (defaults)
- [`REPORT_mysql.md`](reports/REPORT_mysql.md) — MySQL, 5 tools × 17 tables (defaults)
- [`REPORT_mssql.md`](reports/REPORT_mssql.md) — SQL Server, **DBA-harm signals** (lock / transaction / log footprint; competitive perf matrix deferred)
- [`REPORT_combined.md`](reports/REPORT_combined.md) — PG ↔ MySQL drift per tool
- [`REPORT_steelman.md`](reports/REPORT_steelman.md) — same suite re-run with every other tool's chunking / memory knobs tuned, to answer "do they catch up if you put effort in?"
- [`REPORT_throttle_vs_harm.md`](reports/REPORT_throttle_vs_harm.md) — does the `balanced` 50 ms/batch throttle earn its 6.4× wall-clock? (measured by `export_harm`: ~0% harm removed, snapshot held 6× longer)

## Reproducing

### Prereqs

- macOS / Linux with [`docker compose up -d`](../../docker-compose.yaml) running
  the project's `postgres` and `mysql` containers (creds `rivet:rivet`).
- Tools on `$PATH`: `rivet`, `sling`, `duckdb`, `gtime` (coreutils),
  `psql`, `mysql`. `odbc2parquet` + `psqlodbc` are PG-only and optional.

### Seed data

Postgres uses [`dev/bench/seed_bench_pg.sql`](../../dev/bench/seed_bench_pg.sql);
MySQL uses [`dev/bench/seed_bench_mysql.sql`](../../dev/bench/seed_bench_mysql.sql).
SQL Server (DBA-harm probe only) uses
[`dev/bench/seed_bench_mssql.sql`](../../dev/bench/seed_bench_mssql.sql) +
[`harness/mssql_db_bench.sh`](harness/mssql_db_bench.sh).
Both seed the same `content_items`, `page_views`, `audit_log`, `users`, etc.
shapes used in the published runs. Apply them once after the containers come up.

### One-time setup

```bash
bash docs/bench/harness/setup.sh
```

Creates a Python venv at `$BENCH_ROOT/.venv` (default `/tmp/rivet_bench`),
installs `dlt + pyarrow + pymysql + psycopg2-binary`, downloads
`clickhouse-local` into `$BENCH_ROOT/bin/`. Verifies the other tools are
already on `$PATH`.

### Run the benches

```bash
# Both databases, perf only (~5 min wall on local docker):
docs/bench/harness/run_all.sh both

# Add the DB-signal sampler over the three big tables (~20 min more):
docs/bench/harness/run_all.sh both signals

# Just one DB:
docs/bench/harness/run_all.sh pg
docs/bench/harness/run_all.sh mysql signals
```

The orchestrator writes:

- Per-cell JSON to `$BENCH_ROOT/results/`, `$BENCH_ROOT/results_mysql/`,
  `$BENCH_ROOT/db_signals/`, `$BENCH_ROOT/db_signals_mysql/`.
- Aggregated Markdown to `docs/bench/reports/` (overwrites in place).

To re-render reports without re-running benches:

```bash
$BENCH_ROOT/.venv/bin/python docs/bench/harness/aggregate.py
```

### Running a single (tool, table) cell

```bash
export BENCH_ROOT=/tmp/rivet_bench
docs/bench/harness/bench.sh rivet content_items
docs/bench/harness/bench_mysql.sh clickhouse audit_log
docs/bench/harness/db_bench.sh rivet content_items     # adds pg_stat snapshots
```

### Rivet config used

The Rivet runs use [`configs/rivet_pg.yaml`](configs/rivet_pg.yaml) and
[`configs/rivet_mysql.yaml`](configs/rivet_mysql.yaml) — the same
`table: + mode: chunked + chunk_size_memory_mb` strategy you'd write
for a production export. The harness extracts one export at a time
(via the YAML loader at the top of `bench.sh`) so each cell's `gtime`
reading is clean.

## What the JSON records

Per perf cell (`$BENCH_ROOT/results/<tool>_<table>.json`):

```json
{
  "tool": "rivet",
  "table": "content_items",
  "rc": 0,
  "wall": "2:48.7",
  "user_s": "29.1",
  "rss_kb": "440032",
  "pq_files": 25,
  "pq_bytes": 2018870706,
  "rows": 2001291
}
```

Per DB-signal cell (`$BENCH_ROOT/db_signals/<tool>_<table>.json`):

```json
{
  "pg_stat_database_delta": {
    "xact_commit": 4663, "temp_bytes": 0, "temp_files": 0,
    "blks_read": 924552, "blks_hit": 9218424
  },
  "activity_during_run": {
    "peak_active_backends": 1,
    "longest_query_seconds": 0.06,
    "distinct_query_templates": 4,
    "templates_observed": ["FETCH 50000 FROM _rive", "SELECT COUNT(*) FROM …"]
  }
}
```

## Methodology notes

- **gtime is GNU `time` from `coreutils`** (`brew install coreutils` on macOS);
  built-in `/usr/bin/time` reports a subset of fields.
- **Output size counts only `*.parquet`** — dlt's `_dlt_loads` / `_dlt_version`
  metadata files are excluded so the comparison stays apples-to-apples.
- **DB signals are *cluster-wide* counters**, not per-session. Single-tenant
  test box → numbers are clean; shared prod host → expect noise.
- **MySQL `processlist.time`** is whole-second granular. A reading of `0 s`
  means the longest probed query finished in under one sampling interval
  (50 ms), not that it was literally instantaneous.
- **The Rivet configs ANALYZE-assume tables are already analysed** — `rivet`
  reads `pg_class.reltuples` / `information_schema.TABLES.AVG_ROW_LENGTH` for
  the chunk-size derivation. On a freshly seeded DB run `ANALYZE` /
  `ANALYZE TABLE` once before the bench for stable chunk counts.
- **Tool versions used in the published numbers** (refresh with
  `<tool> --version` if reproducing later):
  rivet **0.5.3** (numbers in this report were measured on 0.5.3; v0.6.0
  carries the same Postgres extraction path with no perf-affecting changes,
  and the MySQL batch-memory `row_buf` cap shipped in 0.6.0 only reduces
  RSS further on the wide `content_items` fixture — a re-run against 0.6.0
  is planned) · sling (current `brew install slingdata-io/sling/sling`) ·
  dlt (latest pip release in the harness venv) · duckdb (current `brew`) ·
  clickhouse-local (downloaded by `setup.sh`) ·
  **odbc2parquet 11.0.0** (2026-04-15 release; refreshed
  in commit-pending-this-row from 6.0.7. Wall improved ~15% on the
  full suite — see [`reports/REPORT_pg.md`](reports/REPORT_pg.md) — but the
  architectural shape is unchanged: one `SELECT *` per table held for
  ~2 min on `content_items` at ~29 GB peak RSS).

## What's *not* in this bench (yet)

- Real production network conditions (TLS, cross-region latency, packet loss).
  Everything here is localhost / docker compose.
- Parallel exports — every cell runs serially. Rivet's `--parallel-exports`
  story is in a separate benchmark.
- Destinations other than local Parquet/Snappy. S3 / GCS would change the
  output side but not the source-side signals.
- Postgres `pg_stat_statements` per-template breakdown (extension requires a
  restart we didn't want to put on the test instance). The
  `pg_stat_activity` sampler is the closest substitute we have without that.