rivet-cli 0.10.0

Rivet: PostgreSQL/MySQL/SQL Server → Parquet/CSV (local, S3, GCS, Azure). Crate name rivet-cli; binary rivet.
Documentation

Not sure if Rivet fits your problem? docs/who-is-this-for.md is a 60-second fit-check.

rivet basic workflow — init, doctor, check, run, state

30-second quickstart

brew install panchenkoai/rivet/rivet

export DATABASE_URL="postgresql://user:pass@host/db"
rivet init --source-env DATABASE_URL --table orders -o rivet.yaml
rivet run -c rivet.yaml

Output: Parquet files in ./output/. Full walkthrough: docs/getting-started.md. Want to try without your own DB? docs/pilot/demo-quickstart.md runs the whole flow against a pre-seeded 14-table fixture in ~10 min.


Why Rivet

Rivet tries to make database extraction boring:

  1. Plan before runningrivet plan seals the extraction intent into a reviewable JSON artifact before any writes happen. Review it like a migration.
  2. Protect the source — server-side cursor + FETCH N on PostgreSQL (longest single query: 0.19s on a 2M-row table); adaptive PK-range chunking on MySQL (9s, vs 137–208s for alternatives). Neither shape holds an open transaction for minutes.
  3. Knows you're behind a pooler — auto-detects pgBouncer / Odyssey on Postgres, ProxySQL / MaxScale on MySQL, and statement-level multiplexers (@@SPID drift) or the Azure SQL gateway on SQL Server. On Postgres it uses SET LOCAL inside RAII-guarded transactions so session state never leaks into the pool.
  4. Write in resumable units — chunk checkpoints, not one giant transaction. The job can crash, the network can blip, the next rivet run --resume continues from the last committed chunk.
  5. Record everything — run journal, file manifest, schema-drift tracker, all in .rivet_state.db. Every run is reconstructible. rivet state shows exactly what committed and what didn't.
  6. Validate outputs — quality gates (row count, null ratio, uniqueness via xxHash3), rivet validate, rivet reconcile, rivet repair. Know before your downstream pipeline does.
  7. Notice when the source changes — column adds/removes/retypes trigger on_schema_drift: warn|continue|fail on the next run. Shape drift in TEXT/JSON columns is tracked via byte-width sampling.

The execution contract behind each of these — what is guaranteed, what is at-least-once, what isn't covered — is in docs/semantics.md.

Trust contracts

Question Where to look
What happens if the process is killed mid-export? docs/semantics.md § Crash semantics
What does Rivet not guarantee? docs/semantics.md § Known non-guarantees
What is actually tested in PR CI vs nightly vs manual? docs/reliability-matrix.md
Which PostgreSQL / MySQL versions are exercised? docs/reference/compatibility.md
How are credentials handled? Where do sensitive artifacts land? SECURITY.md
What permissions does Rivet need on S3 / GCS / Azure? docs/cloud-permissions.md
How were the benchmark numbers produced — can I rerun them? docs/bench/

Sensitive local artifacts. Generated files — .rivet_state.db, plan.json, *.journal.jsonl, and exported Parquet/CSV — may contain query SQL, cursor values, table metadata, and the data itself. Do not commit them. See SECURITY.md § Sensitive local artifacts for a .gitignore snippet.


Source pressure, measured

"Source-safe" is easy to claim and hard to verify, so Rivet publishes a reproducible cross-tool benchmark harness against identical fixtures (22 PG tables / 17 MySQL tables, including a 2M-row × 20-column content_items table).

The primary metric is longest single SQL statement — the one that decides whether your DBA's statement_timeout cuts you off mid-run.

PostgreSQL — server-side cursor enables sub-second longest query

Tool Longest single query Peak RSS
rivet 0.19s (FETCH 142 FROM _rive) 443 MB
dlt 1.20s (FETCH FORWARD 10000) — 3.2 GB temp_bytes 1.4 GB
sling 134s (SELECT * FROM content_items) 6.0 GB

MySQL — no server-side cursor; chunked range scans are the fastest available shape

Tool Longest single query Peak RSS
rivet 9s (chunked + cursor) 280 MB
sling 137s 6.3 GB
dlt 208s 1.2 GB

The MySQL gap vs PostgreSQL is architectural: PostgreSQL exposes DECLARE … CURSOR / FETCH N which lets Rivet issue tiny sub-queries server-side; MySQL's protocol does not have a widely-supported equivalent in the current client stack. See MySQL parity roadmap for what's planned.

Failure count across all tables: rivet 0 / 22 (PG), 0 / 17 (MySQL). At least one other tool in the suite failed at least one table.

How Rivet wins these axes is not magic — it's the deliberately boring extraction shape: PK-auto-resolved chunks, a server-side cursor with a work_mem-aware FETCH N cap on PG, and an Arrow-memory-budgeted row buffer on MySQL. The «one big SELECT * into a giant client-side buffer» shape that most alternatives use produces both the multi-minute single-query holds and the multi-GB RSS.

The numbers above use each tool at its defaults. We also published a steelman re-run that gives each competitor its best plausible configuration. Short version: on narrow tables the gap closes; on the wide content_items fixture Rivet's edge survives largely intact.

Methodology, exact configs, raw gtime -v output, and DB-side counter deltas: docs/bench/ — one-command repro.


AI-native DB observability — rivet-mcp

rivet-mcp is a Model Context Protocol server binary that lets an AI agent answer "is this database healthy enough to extract from right now?" — before any rows are touched.

Exposed read-only surfaces:

  • PostgreSQLpg_stat_activity (active queries, lock waits, idle-in-transaction), pg_stat_statements top I/O, checkpoint pressure (pg_stat_bgwriter), pgBouncer pool saturation and client wait time
  • MySQLSHOW PROCESSLIST (running queries and duration)

Works out-of-the-box with Claude Desktop, Claude Code, and any MCP-compatible client. Runs as a separate binary — never requires write access to the source database.

export DATABASE_URL="postgresql://..."
rivet-mcp        # reads DATABASE_URL from the environment

Add to your MCP client config:

{
  "mcpServers": {
    "rivet": {
      "command": "rivet-mcp",
      "env": { "DATABASE_URL": "postgresql://..." }
    }
  }
}

What Rivet is (and is not)

What Rivet does What you bring
Queries PostgreSQL 12–16 and MySQL 5.7 / 8.0 The database and credentials
Streams rows → Arrow → Parquet or CSV A destination (local path, S3 bucket, GCS bucket, Azure container)
Retries failed batches with exponential backoff Orchestration (cron, Airflow, dbt, etc.)
Validates row counts, null ratios, and uniqueness Your warehouse or downstream pipeline
Checkpoints progress — resume after crashes Schema management on the warehouse side
Protects the source DB — longest single query ~0.2s on PG / ~9s on MySQL on 2M-row tables

Supported destinations: local filesystem, Amazon S3, Google Cloud Storage, Azure Blob Storage, stdout. Export modes: full, incremental (cursor-based), chunked, time_window. Formats: Parquet (zstd / snappy / gzip / lz4 / none) and CSV.

Not for you if you need:

  • CDC / streaming — Rivet reads a snapshot per run; no replication slot or event log. Use Debezium or Estuary.
  • Connectors to SaaS sources — no Salesforce, Stripe, HubSpot, etc. Use Airbyte or Fivetran.
  • An integrated extract-and-load product — Rivet stops at "file in a bucket." Use dlt or Sling if you want the warehouse load included.
  • Loading or transformation — bring dbt, Spark, or your own loader.
  • A Kubernetes data platform — Rivet runs as a single binary in a Job or CronJob; a full operator is a different architecture.

Documentation language: English-only. See CONTRIBUTING.md.

Core workflow

rivet init      # scaffold config from a live DB (discovers tables, infers cursors)
rivet doctor    # verify credentials and destination auth before the run
rivet check     # validate config logic, warn about chunking and cursor choices
rivet plan      # seal execution intent — reviewable JSON artifact, no writes yet
rivet run       # execute the plan; checkpoint each chunk
rivet validate  # verify row counts and manifest against the destination

Branch commands: rivet apply (apply a saved plan), rivet reconcile (compare manifest vs destination), rivet repair (re-upload orphaned chunks), rivet state (inspect progression and checkpoints).

For a first run, rivet init + rivet run is enough. The full workflow is for production pipelines where "it ran" is not sufficient — you need a verifiable record of what was written.

Stateless deployment

By default Rivet keeps cursors, manifests, chunk checkpoints, and the run journal in a SQLite file (.rivet_state.db) next to your config — perfect for local and single-node runs. For ephemeral containers / Kubernetes pods, set RIVET_STATE_URL to a PostgreSQL connection string and Rivet creates and migrates the state schema on first connect — no manual DDL, no init job. Details: docs/reference/cli.md § State backend.

export RIVET_STATE_URL="postgresql://rivet:secret@state-db.internal/rivet_state?sslmode=require"
rivet run -c rivet.yaml

More walkthroughs

plan / apply · plan campaign — multi-export waves · reconcile + repair · parallel cards UI · composite cursor (COALESCE fallback) · pool detection · discovery artifact (rivet init --discover) · post-run inspect. Source scripts in docs/gifs/.


Installation

Names. The project and CLI are Rivet; the command is rivet. On crates.io the package is published as rivet-cli because the crate name rivet was already taken. Homebrew and release archives install the rivet binary.

Homebrew (macOS / Linux) — recommended

brew install panchenkoai/rivet/rivet
rivet --version

cargo install (crates.io)

Requires Rust 1.94+:

cargo install rivet-cli
rivet --version

Pre-built binaries

Download the latest release for your platform from GitHub Releases:

# macOS (Apple Silicon)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-aarch64-apple-darwin.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-x86_64-apple-darwin.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-x86_64-unknown-linux-gnu.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# Linux (arm64)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-aarch64-unknown-linux-gnu.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/
rivet --version

Verify the download against the published checksums (every release ships SHA256SUMS.txt):

# Download the tarball + SHA256SUMS.txt from the same release, then:
sha256sum -c SHA256SUMS.txt        # Linux
shasum -a 256 -c SHA256SUMS.txt    # macOS

Docker

docker run --rm ghcr.io/panchenkoai/rivet:latest --version

docker run --rm \
  -e DATABASE_URL="postgresql://user:pass@host.docker.internal:5432/db" \
  -v $(pwd)/examples/rivet.yaml:/config/rivet.yaml \
  -v $(pwd)/output:/output \
  ghcr.io/panchenkoai/rivet:latest \
  run --config /config/rivet.yaml

From a container, localhost is not your machine. Use host.docker.internal (Docker Desktop) or --add-host=host.docker.internal:host-gateway on Linux. See Getting Started for details.

Build from source

Requires Rust 1.94+:

git clone https://github.com/panchenkoai/rivet.git
cd rivet
cargo build --release
# binary is at target/release/rivet

Resource-aware extraction

These are production-safety primitives, not performance knobs.

Memory controls

Setting What it controls
tuning.max_batch_memory_mb Hard cap on a single Arrow batch. When exceeded, the on_batch_memory_exceeded policy fires.
tuning.on_batch_memory_exceeded warn (log + continue) · fail (abort) · auto_shrink (split batch recursively, then continue)
tuning.memory_threshold_mb Process-level RSS guard — pauses fetching when RSS exceeds the threshold
tuning.batch_size_memory_mb Adaptive batch sizing: Rivet samples the first batch to estimate row width, then adjusts subsequent batch sizes automatically

Output controls

Setting What it controls
compression_profile none / fast (Snappy) / balanced (Zstd-3) / compact (Zstd-9)
parquet.row_group_strategy auto (schema-based estimate) / fixed_rows / fixed_memory
parquet.target_row_group_mb Target row group size; lower values reduce peak RSS during Parquet writes

Quality gates

Setting What it controls
quality.row_count_min / row_count_max Fail the export if row count is outside this range — fires even when the source returns 0 rows
quality.null_ratio_max Fail the export if the null ratio in a column exceeds the threshold
quality.unique_columns Track column uniqueness via typed xxHash3-64 hashing
quality.unique_max_entries Cap the uniqueness hash set to prevent unbounded memory growth on high-cardinality columns

Choosing settings for your environment

Environment Recommended starting point
Production database (shared) profile: safe, max_batch_memory_mb: 128, on_batch_memory_exceeded: warn
CI / strict pipeline max_batch_memory_mb: 128, on_batch_memory_exceeded: fail
Low-memory host (1–2 GB) profile: safe, max_batch_memory_mb: 64, on_batch_memory_exceeded: auto_shrink
Read replica / fast backfill profile: fast, compression_profile: fast

See the Best Practices guides for detailed explanations, trade-off analysis, and worked examples:


Documentation

Topic Link
All docs (index) docs/README.md
First run — install, connect, export docs/getting-started.md
Concepts glossary (run_id, cursor, chunk, manifest, journal, progression) docs/concepts.md
Pilot guide — full flow on your own database, production-ready docs/pilot/README.md
Execution semantics (crash / retry / resume contract) docs/semantics.md
Reliability matrix (what's in PR CI / nightly / manual) docs/reliability-matrix.md
Security policy (credentials, sensitive artifacts, disclosure) SECURITY.md
Cloud permissions (least-privilege IAM / RBAC / SAS per command) docs/cloud-permissions.md
Cross-tool benchmark harness docs/bench/
Export modes (full, incremental, chunked, time_window) docs/modes/
Destinations (local, S3, GCS, Azure, stdout) docs/destinations/
Config YAML reference docs/reference/config.md
CLI commands and flags docs/reference/cli.md
Tuning profiles docs/reference/tuning.md
Scaffold config from a live DB (rivet init) docs/reference/init.md
Pipeline, traits, memory model, source layout docs/architecture.md
Demo on a pre-seeded 14-table fixture (~10 min) docs/pilot/demo-quickstart.md
Pilot walkthrough — discovery → reconcile → repair docs/pilot/pilot-walkthrough.md
Production checklist docs/pilot/production-checklist.md
Operator recipes (resume, idempotent load) docs/recipes/
Architecture decision records docs/adr/
Contributing, tests, CI CONTRIBUTING.md

Releases and roadmap

  • Latest release and version history: CHANGELOG.md.
  • Strategy and execution tracker: rivet_roadmap.md — the single source of truth for what is shipped and what is open.
  • Questions, issues, feature requests: GitHub Issues.