Not sure if Rivet fits your problem? docs/who-is-this-for.md is a 60-second fit-check.

30-second quickstart
Output: Parquet files in ./output/. Full walkthrough: docs/getting-started.md. Want to try without your own DB? docs/pilot/demo-quickstart.md runs the whole flow against a pre-seeded 14-table fixture in ~10 min.
Why Rivet
Rivet tries to make database extraction boring:
- Plan before running —
rivet planseals the extraction intent into a reviewable JSON artifact before any writes happen. Review it like a migration. - Protect the source — server-side cursor +
FETCH Non PostgreSQL (longest single query: 0.19s on a 2M-row table); adaptive PK-range chunking on MySQL (9s, vs 137–208s for alternatives). Neither shape holds an open transaction for minutes. - Knows you're behind a pooler — auto-detects pgBouncer / Odyssey on Postgres and ProxySQL / MaxScale on MySQL. Uses
SET LOCALinside RAII-guarded transactions so session state never leaks into the pool. - Write in resumable units — chunk checkpoints, not one giant transaction. The job can crash, the network can blip, the next
rivet run --resumecontinues from the last committed chunk. - Record everything — run journal, file manifest, schema-drift tracker, all in
.rivet_state.db. Every run is reconstructible.rivet stateshows exactly what committed and what didn't. - Validate outputs — quality gates (row count, null ratio, uniqueness via xxHash3),
rivet validate,rivet reconcile,rivet repair. Know before your downstream pipeline does. - Notice when the source changes — column adds/removes/retypes trigger
on_schema_drift: warn|continue|failon the next run. Shape drift in TEXT/JSON columns is tracked via byte-width sampling.
The execution contract behind each of these — what is guaranteed, what is at-least-once, what isn't covered — is in docs/semantics.md.
Trust contracts
| Question | Where to look |
|---|---|
| What happens if the process is killed mid-export? | docs/semantics.md § Crash semantics |
| What does Rivet not guarantee? | docs/semantics.md § Known non-guarantees |
| What is actually tested in PR CI vs nightly vs manual? | docs/reliability-matrix.md |
| Which PostgreSQL / MySQL versions are exercised? | docs/reference/compatibility.md |
| How are credentials handled? Where do sensitive artifacts land? | SECURITY.md |
| What permissions does Rivet need on S3 / GCS / Azure? | docs/cloud-permissions.md |
| How were the benchmark numbers produced — can I rerun them? | docs/bench/ |
Sensitive local artifacts. Generated files —
.rivet_state.db,plan.json,*.journal.jsonl, and exported Parquet/CSV — may contain query SQL, cursor values, table metadata, and the data itself. Do not commit them. See SECURITY.md § Sensitive local artifacts for a.gitignoresnippet.
Source pressure, measured
"Source-safe" is easy to claim and hard to verify, so Rivet publishes a reproducible cross-tool benchmark harness against identical fixtures (22 PG tables / 17 MySQL tables, including a 2M-row × 20-column content_items table).
The primary metric is longest single SQL statement — the one that decides whether your DBA's statement_timeout cuts you off mid-run.
PostgreSQL — server-side cursor enables sub-second longest query
| Tool | Longest single query | Peak RSS |
|---|---|---|
| rivet | 0.19s (FETCH 142 FROM _rive) |
443 MB |
| dlt | 1.20s (FETCH FORWARD 10000) — 3.2 GB temp_bytes |
1.4 GB |
| sling | 134s (SELECT * FROM content_items) |
6.0 GB |
MySQL — no server-side cursor; chunked range scans are the fastest available shape
| Tool | Longest single query | Peak RSS |
|---|---|---|
| rivet | 9s (chunked + cursor) | 280 MB |
| sling | 137s | 6.3 GB |
| dlt | 208s | 1.2 GB |
The MySQL gap vs PostgreSQL is architectural: PostgreSQL exposes DECLARE … CURSOR / FETCH N which lets Rivet issue tiny sub-queries server-side; MySQL's protocol does not have a widely-supported equivalent in the current client stack. See MySQL parity roadmap for what's planned.
Failure count across all tables: rivet 0 / 22 (PG), 0 / 17 (MySQL). At least one other tool in the suite failed at least one table.
How Rivet wins these axes is not magic — it's the deliberately boring extraction shape: PK-auto-resolved chunks, a server-side cursor with a work_mem-aware FETCH N cap on PG, and an Arrow-memory-budgeted row buffer on MySQL. The «one big SELECT * into a giant client-side buffer» shape that most alternatives use produces both the multi-minute single-query holds and the multi-GB RSS.
The numbers above use each tool at its defaults. We also published a steelman re-run that gives each competitor its best plausible configuration. Short version: on narrow tables the gap closes; on the wide content_items fixture Rivet's edge survives largely intact.
Methodology, exact configs, raw gtime -v output, and DB-side counter deltas: docs/bench/ — one-command repro.
AI-native DB observability — rivet-mcp
rivet-mcp is a Model Context Protocol server binary that lets an AI agent answer "is this database healthy enough to extract from right now?" — before any rows are touched.
Exposed read-only surfaces:
- PostgreSQL —
pg_stat_activity(active queries, lock waits, idle-in-transaction),pg_stat_statementstop I/O, checkpoint pressure (pg_stat_bgwriter), pgBouncer pool saturation and client wait time - MySQL —
SHOW PROCESSLIST(running queries and duration)
Works out-of-the-box with Claude Desktop, Claude Code, and any MCP-compatible client. Runs as a separate binary — never requires write access to the source database.
Add to your MCP client config:
What Rivet is (and is not)
| What Rivet does | What you bring |
|---|---|
| Queries PostgreSQL 12–16 and MySQL 5.7 / 8.0 | The database and credentials |
| Streams rows → Arrow → Parquet or CSV | A destination (local path, S3 bucket, GCS bucket, Azure container) |
| Retries failed batches with exponential backoff | Orchestration (cron, Airflow, dbt, etc.) |
| Validates row counts, null ratios, and uniqueness | Your warehouse or downstream pipeline |
| Checkpoints progress — resume after crashes | Schema management on the warehouse side |
| Protects the source DB — longest single query ~0.2s on PG / ~9s on MySQL on 2M-row tables | — |
Supported destinations: local filesystem, Amazon S3, Google Cloud Storage, Azure Blob Storage, stdout.
Export modes: full, incremental (cursor-based), chunked, time_window.
Formats: Parquet (zstd / snappy / gzip / lz4 / none) and CSV.
Not for you if you need:
- CDC / streaming — Rivet reads a snapshot per run; no replication slot or event log. Use Debezium or Estuary.
- Connectors to SaaS sources — no Salesforce, Stripe, HubSpot, etc. Use Airbyte or Fivetran.
- An integrated extract-and-load product — Rivet stops at "file in a bucket." Use dlt or Sling if you want the warehouse load included.
- Loading or transformation — bring dbt, Spark, or your own loader.
- A Kubernetes data platform — Rivet runs as a single binary in a
JoborCronJob; a full operator is a different architecture.
Documentation language: English-only. See CONTRIBUTING.md.
Core workflow
rivet init # scaffold config from a live DB (discovers tables, infers cursors)
rivet doctor # verify credentials and destination auth before the run
rivet check # validate config logic, warn about chunking and cursor choices
rivet plan # seal execution intent — reviewable JSON artifact, no writes yet
rivet run # execute the plan; checkpoint each chunk
rivet validate # verify row counts and manifest against the destination
Branch commands: rivet apply (apply a saved plan), rivet reconcile (compare manifest vs destination), rivet repair (re-upload orphaned chunks), rivet state (inspect progression and checkpoints).
For a first run, rivet init + rivet run is enough. The full workflow is for production pipelines where "it ran" is not sufficient — you need a verifiable record of what was written.
Stateless deployment
By default Rivet keeps cursors, manifests, chunk checkpoints, and the run journal in a SQLite file (.rivet_state.db) next to your config — perfect for local and single-node runs. For ephemeral containers / Kubernetes pods, set RIVET_STATE_URL to a PostgreSQL connection string and Rivet creates and migrates the state schema on first connect — no manual DDL, no init job. Details: docs/reference/cli.md § State backend.
More walkthroughs
plan / apply · plan campaign — multi-export waves · reconcile + repair · parallel cards UI · composite cursor (COALESCE fallback) · pool detection · discovery artifact (rivet init --discover) · post-run inspect. Source scripts in docs/gifs/.
Installation
Names. The project and CLI are Rivet; the command is
rivet. On crates.io the package is published asrivet-clibecause the crate namerivetwas already taken. Homebrew and release archives install therivetbinary.
Homebrew (macOS / Linux) — recommended
cargo install (crates.io)
Requires Rust 1.94+:
Pre-built binaries
Download the latest release for your platform from GitHub Releases:
# macOS (Apple Silicon)
|
# macOS (Intel)
|
# Linux (x86_64)
|
# Linux (arm64)
|
Docker
From a container,
localhostis not your machine. Usehost.docker.internal(Docker Desktop) or--add-host=host.docker.internal:host-gatewayon Linux. See Getting Started for details.
Build from source
Requires Rust 1.94+:
# binary is at target/release/rivet
Resource-aware extraction
These are production-safety primitives, not performance knobs.
Memory controls
| Setting | What it controls |
|---|---|
tuning.max_batch_memory_mb |
Hard cap on a single Arrow batch. When exceeded, the on_batch_memory_exceeded policy fires. |
tuning.on_batch_memory_exceeded |
warn (log + continue) · fail (abort) · auto_shrink (split batch recursively, then continue) |
tuning.memory_threshold_mb |
Process-level RSS guard — pauses fetching when RSS exceeds the threshold |
tuning.batch_size_memory_mb |
Adaptive batch sizing: Rivet samples the first batch to estimate row width, then adjusts subsequent batch sizes automatically |
Output controls
| Setting | What it controls |
|---|---|
compression_profile |
none / fast (Snappy) / balanced (Zstd-3) / compact (Zstd-9) |
parquet.row_group_strategy |
auto (schema-based estimate) / fixed_rows / fixed_memory |
parquet.target_row_group_mb |
Target row group size; lower values reduce peak RSS during Parquet writes |
Quality gates
| Setting | What it controls |
|---|---|
quality.row_count_min / row_count_max |
Fail the export if row count is outside this range — fires even when the source returns 0 rows |
quality.null_ratio_max |
Fail the export if the null ratio in a column exceeds the threshold |
quality.unique_columns |
Track column uniqueness via typed xxHash3-64 hashing |
quality.unique_max_entries |
Cap the uniqueness hash set to prevent unbounded memory growth on high-cardinality columns |
Choosing settings for your environment
| Environment | Recommended starting point |
|---|---|
| Production database (shared) | profile: safe, max_batch_memory_mb: 128, on_batch_memory_exceeded: warn |
| CI / strict pipeline | max_batch_memory_mb: 128, on_batch_memory_exceeded: fail |
| Low-memory host (1–2 GB) | profile: safe, max_batch_memory_mb: 64, on_batch_memory_exceeded: auto_shrink |
| Read replica / fast backfill | profile: fast, compression_profile: fast |
See the Best Practices guides for detailed explanations, trade-off analysis, and worked examples:
- Resource-aware extraction — memory budgets, policies, RSS formula
- Parquet tuning — row group strategies, targets, downstream read implications
- Compression profiles — profile-to-codec mapping, CPU/size trade-offs
- Quality checks — row count gates, null ratio, uniqueness cap
- Low-memory runners — settings for 512 MB–4 GB hosts
- Recovery and resume —
--resumesemantics, crash recovery
Documentation
| Topic | Link |
|---|---|
| All docs (index) | docs/README.md |
| First run — install, connect, export | docs/getting-started.md |
Concepts glossary (run_id, cursor, chunk, manifest, journal, progression) |
docs/concepts.md |
| Pilot guide — full flow on your own database, production-ready | docs/pilot/README.md |
| Execution semantics (crash / retry / resume contract) | docs/semantics.md |
| Reliability matrix (what's in PR CI / nightly / manual) | docs/reliability-matrix.md |
| Security policy (credentials, sensitive artifacts, disclosure) | SECURITY.md |
| Cloud permissions (least-privilege IAM / RBAC / SAS per command) | docs/cloud-permissions.md |
| Cross-tool benchmark harness | docs/bench/ |
Export modes (full, incremental, chunked, time_window) |
docs/modes/ |
| Destinations (local, S3, GCS, Azure, stdout) | docs/destinations/ |
| Config YAML reference | docs/reference/config.md |
| CLI commands and flags | docs/reference/cli.md |
| Tuning profiles | docs/reference/tuning.md |
Scaffold config from a live DB (rivet init) |
docs/reference/init.md |
| Pipeline, traits, memory model, source layout | docs/architecture.md |
| Demo on a pre-seeded 14-table fixture (~10 min) | docs/pilot/demo-quickstart.md |
| Pilot walkthrough — discovery → reconcile → repair | docs/pilot/pilot-walkthrough.md |
| Production checklist | docs/pilot/production-checklist.md |
| Operator recipes (resume, idempotent load) | docs/recipes/ |
| Architecture decision records | docs/adr/ |
| Contributing, tests, CI | CONTRIBUTING.md |
Releases and roadmap
- Latest release and version history: CHANGELOG.md.
- Strategy and execution tracker: rivet_roadmap.md — the single source of truth for what is shipped and what is open.
- Questions, issues, feature requests: GitHub Issues.