rivet-cli 0.11.0

Rivet: PostgreSQL/MySQL/SQL Server → Parquet/CSV (local, S3, GCS, Azure). Crate name rivet-cli; binary rivet.
Documentation
<p align="center">
  <img src="docs/assets/rivet_logo.png" alt="Rivet" width="480">
</p>

<p align="center">
  <a href="https://github.com/panchenkoai/rivet/actions/workflows/ci.yml"><img src="https://github.com/panchenkoai/rivet/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
  <a href="https://github.com/panchenkoai/rivet/releases/latest"><img src="https://img.shields.io/github/v/release/panchenkoai/rivet?label=release" alt="release"></a>
  <a href="docs/reliability-matrix.md"><img src="https://img.shields.io/badge/coverage-reliability%20matrix-blue" alt="coverage"></a>
  <a href="https://github.com/panchenkoai/rivet/blob/main/LICENSE"><img src="https://img.shields.io/github/license/panchenkoai/rivet" alt="license"></a>
  <a href="https://discord.gg/HT5DZNzNU"><img src="https://img.shields.io/badge/Discord-join%20chat-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
</p>

<p align="center"><strong>Make database extraction boring.</strong></p>

<p align="center">One Rust binary, ~18 MB (speed-optimized). Extracts PostgreSQL, MySQL, and SQL Server to Parquet/CSV — locally, on S3, GCS, or Azure Blob — without holding long queries open on your production database. Resumable, auditable, source-safe.</p>

> Not sure if Rivet fits your problem? [docs/who-is-this-for.md](docs/who-is-this-for.md) is a 60-second fit-check.

![rivet basic workflow — init, doctor, check, run, state](https://raw.githubusercontent.com/panchenkoai/rivet/main/docs/gifs/basic.gif)

## 30-second quickstart

```bash
brew install panchenkoai/rivet/rivet

export DATABASE_URL="postgresql://user:pass@host/db"
# `orders` is a placeholder — use one of YOUR tables, or omit --table to scan the whole schema
rivet init --source-env DATABASE_URL --table orders -o rivet.yaml
rivet run -c rivet.yaml
```

Output: Parquet files in `./output/`. Full walkthrough: [docs/getting-started.md](docs/getting-started.md). Want to try without your own DB? `docs/pilot/demo-quickstart.md` runs the whole flow against a pre-seeded 14-table fixture in ~10 min.

---

## Why Rivet

Rivet tries to make database extraction boring:

1. **Plan before running** — `rivet plan` seals the extraction intent into a reviewable JSON artifact before any writes happen. Review it like a migration.
2. **Protect the source** — server-side cursor + `FETCH N` on PostgreSQL (longest single query: **0.19s** on a 2M-row table); adaptive PK-range chunking on MySQL (**9s**, vs 137–208s for alternatives). Neither shape holds an open transaction for minutes.
3. **Knows you're behind a pooler** — auto-detects pgBouncer / Odyssey on Postgres, ProxySQL / MaxScale on MySQL, and statement-level multiplexers (`@@SPID` drift) or the Azure SQL gateway on SQL Server. On Postgres it uses `SET LOCAL` inside RAII-guarded transactions so session state never leaks into the pool.
4. **Write in resumable units** — chunk checkpoints, not one giant transaction. The job can crash, the network can blip, the next `rivet run --resume` continues from the last committed chunk.
5. **Record everything** — run journal, file manifest, schema-drift tracker, all in `.rivet_state.db`. Every run is reconstructible. `rivet state` shows exactly what committed and what didn't.
6. **Validate outputs** — quality gates (row count, null ratio, uniqueness via xxHash3), `rivet validate`, `rivet reconcile`, `rivet repair`. Know before your downstream pipeline does.
7. **Notice when the source changes** — column adds/removes/retypes trigger `on_schema_drift: warn|continue|fail` on the next run. Shape drift in TEXT/JSON columns is tracked via byte-width sampling.

The execution contract behind each of these — what is guaranteed, what is at-least-once, what isn't covered — is in [docs/semantics.md](docs/semantics.md).

## Trust contracts

| Question | Where to look |
|---|---|
| What happens if the process is killed mid-export? | [docs/semantics.md § Crash semantics](docs/semantics.md#crash-semantics) |
| What does Rivet *not* guarantee? | [docs/semantics.md § Known non-guarantees](docs/semantics.md#known-non-guarantees) |
| What is actually tested in PR CI vs nightly vs manual? | [docs/reliability-matrix.md](docs/reliability-matrix.md) |
| Which PostgreSQL / MySQL versions are exercised? | [docs/reference/compatibility.md](docs/reference/compatibility.md) |
| How are credentials handled? Where do sensitive artifacts land? | [SECURITY.md](SECURITY.md) |
| What permissions does Rivet need on S3 / GCS / Azure? | [docs/cloud-permissions.md](docs/cloud-permissions.md) |
| How were the benchmark numbers produced — can I rerun them? | [docs/bench/](docs/bench/) |

> **Sensitive local artifacts.** Generated files — `.rivet_state.db`, `plan.json`, `*.journal.jsonl`, and exported Parquet/CSV — may contain query SQL, cursor values, table metadata, and the data itself. Do not commit them. See [SECURITY.md § Sensitive local artifacts](SECURITY.md#sensitive-local-artifacts) for a `.gitignore` snippet.

---

## Source pressure, measured

"Source-safe" is easy to claim and hard to verify, so Rivet publishes a [reproducible cross-tool benchmark harness](docs/bench/) against identical fixtures (22 PG tables / 17 MySQL tables, including a 2M-row × 20-column `content_items` table).

The primary metric is **longest single SQL statement** — the one that decides whether your DBA's `statement_timeout` cuts you off mid-run.

### PostgreSQL — server-side cursor enables sub-second longest query

| Tool | Longest single query | Peak RSS |
|---|---:|---:|
| **rivet** | **0.19s** (`FETCH 142 FROM _rive`) | **443 MB** |
| dlt | 1.20s (`FETCH FORWARD 10000`) — 3.2 GB temp_bytes | 1.4 GB |
| sling | 134s (`SELECT * FROM content_items`) | 6.0 GB |

### MySQL — no server-side cursor; chunked range scans are the fastest available shape

| Tool | Longest single query | Peak RSS |
|---|---:|---:|
| **rivet** | **9s** (chunked + cursor) | **280 MB** |
| sling | 137s | 6.3 GB |
| dlt | 208s | 1.2 GB |

The MySQL gap vs PostgreSQL is architectural: PostgreSQL exposes `DECLARE … CURSOR` / `FETCH N` which lets Rivet issue tiny sub-queries server-side; MySQL's protocol does not have a widely-supported equivalent in the current client stack. See [MySQL parity roadmap](#releases-and-roadmap) for what's planned.

**Failure count across all tables**: rivet 0 / 22 (PG), 0 / 17 (MySQL). At least one other tool in the suite failed at least one table.

How Rivet wins these axes is not magic — it's the deliberately boring extraction shape: PK-auto-resolved chunks, a server-side cursor with a `work_mem`-aware `FETCH N` cap on PG, and an Arrow-memory-budgeted row buffer on MySQL. The «one big `SELECT *` into a giant client-side buffer» shape that most alternatives use produces both the multi-minute single-query holds and the multi-GB RSS.

The numbers above use each tool **at its defaults**. We also published a [**steelman**](docs/bench/reports/REPORT_steelman.md) re-run that gives each competitor its best plausible configuration. Short version: on narrow tables the gap closes; on the wide `content_items` fixture Rivet's edge survives largely intact.

Methodology, exact configs, raw `gtime -v` output, and DB-side counter deltas: [docs/bench/](docs/bench/) — one-command repro.

---

## AI-native DB observability — `rivet-mcp`

`rivet-mcp` is a [Model Context Protocol](https://modelcontextprotocol.io/) server binary that lets an AI agent answer *"is this database healthy enough to extract from right now?"* — before any rows are touched.

Exposed read-only surfaces:

- **PostgreSQL** — `pg_stat_activity` (active queries, lock waits, idle-in-transaction), `pg_stat_statements` top I/O, checkpoint pressure (`pg_stat_bgwriter`), pgBouncer pool saturation and client wait time
- **MySQL** — `SHOW PROCESSLIST` (running queries and duration)

Works out-of-the-box with [Claude Desktop](https://claude.ai/), [Claude Code](https://claude.ai/code), and any MCP-compatible client. Runs as a separate binary — never requires write access to the source database.

```bash
export DATABASE_URL="postgresql://..."
rivet-mcp        # reads DATABASE_URL from the environment
```

Add to your MCP client config:

```json
{
  "mcpServers": {
    "rivet": {
      "command": "rivet-mcp",
      "env": { "DATABASE_URL": "postgresql://..." }
    }
  }
}
```

---

## What Rivet is (and is not)

| What Rivet does | What you bring |
|-----------------|----------------|
| Queries PostgreSQL 12–16 and MySQL 5.7 / 8.0 | The database and credentials |
| Streams rows → Arrow → Parquet or CSV | A destination (local path, S3 bucket, GCS bucket, Azure container) |
| Retries failed batches with exponential backoff | Orchestration (cron, Airflow, dbt, etc.) |
| Validates row counts, null ratios, and uniqueness | Your warehouse or downstream pipeline |
| Checkpoints progress — resume after crashes | Schema management on the warehouse side |
| Protects the source DB — longest single query ~0.2s on PG / ~9s on MySQL on 2M-row tables | — |

Supported destinations: local filesystem, Amazon S3, Google Cloud Storage, Azure Blob Storage, stdout.
Export modes: `full`, `incremental` (cursor-based), `chunked`, `time_window`.
Formats: Parquet (zstd / snappy / gzip / lz4 / none) and CSV.

**Not for you if you need:**
- **CDC / streaming** — Rivet reads a snapshot per run; no replication slot or event log. Use [Debezium](https://debezium.io/) or [Estuary](https://estuary.dev/).
- **Connectors to SaaS sources** — no Salesforce, Stripe, HubSpot, etc. Use [Airbyte](https://airbyte.com/) or [Fivetran](https://www.fivetran.com/).
- **An integrated extract-and-load product** — Rivet stops at "file in a bucket." Use [dlt](https://dlthub.com/) or [Sling](https://slingdata.io/) if you want the warehouse load included.
- **Loading or transformation** — bring dbt, Spark, or your own loader.
- **A Kubernetes data platform** — Rivet runs as a single binary in a `Job` or `CronJob`; a full operator is a different architecture.

**Documentation language:** English-only. See [CONTRIBUTING.md](CONTRIBUTING.md).

## Core workflow

```
rivet init      # scaffold config from a live DB (discovers tables, infers cursors)
rivet doctor    # verify credentials and destination auth before the run
rivet check     # validate config logic, warn about chunking and cursor choices
rivet plan      # seal execution intent — reviewable JSON artifact, no writes yet
rivet run       # execute the plan; checkpoint each chunk
rivet validate  # verify row counts and manifest against the destination
```

Branch commands: `rivet apply` (apply a saved plan), `rivet reconcile` (compare manifest vs destination), `rivet repair` (re-upload orphaned chunks), `rivet state` (inspect progression and checkpoints).

For a first run, `rivet init + rivet run` is enough. The full workflow is for production pipelines where "it ran" is not sufficient — you need a verifiable record of what was written.

## Stateless deployment

By default Rivet keeps cursors, manifests, chunk checkpoints, and the run journal in a SQLite file (`.rivet_state.db`) next to your config — perfect for local and single-node runs. For ephemeral containers / Kubernetes pods, set `RIVET_STATE_URL` to a PostgreSQL connection string and Rivet creates and migrates the state schema on first connect — no manual DDL, no init job. Details: [docs/reference/cli.md § State backend](docs/reference/cli.md#state-backend).

```bash
export RIVET_STATE_URL="postgresql://rivet:secret@state-db.internal/rivet_state?sslmode=require"
rivet run -c rivet.yaml
```

## More walkthroughs

[plan / apply](https://raw.githubusercontent.com/panchenkoai/rivet/main/docs/gifs/plan-apply.gif) · [plan campaign — multi-export waves](https://raw.githubusercontent.com/panchenkoai/rivet/main/docs/gifs/plan-campaign.gif) · [reconcile + repair](https://raw.githubusercontent.com/panchenkoai/rivet/main/docs/gifs/reconcile-repair.gif) · [parallel cards UI](https://raw.githubusercontent.com/panchenkoai/rivet/main/docs/gifs/parallel-cards.gif) · [composite cursor (COALESCE fallback)](https://raw.githubusercontent.com/panchenkoai/rivet/main/docs/gifs/coalesce-cursor.gif) · [pool detection](https://raw.githubusercontent.com/panchenkoai/rivet/main/docs/gifs/pool-detect.gif) · [discovery artifact (`rivet init --discover`)](https://raw.githubusercontent.com/panchenkoai/rivet/main/docs/gifs/discover-artifact.gif) · [post-run inspect](https://raw.githubusercontent.com/panchenkoai/rivet/main/docs/gifs/inspect.gif). Source scripts in [docs/gifs/](https://github.com/panchenkoai/rivet/tree/main/docs/gifs).

---

## Installation

> **Names.** The project and CLI are **Rivet**; the command is **`rivet`**. On [crates.io](https://crates.io/crates/rivet-cli) the package is published as **`rivet-cli`** because the crate name `rivet` was already taken. Homebrew and release archives install the **`rivet`** binary.

### Homebrew (macOS / Linux) — recommended

```bash
brew install panchenkoai/rivet/rivet
rivet --version
```

### cargo install (crates.io)

Requires Rust 1.94+:

```bash
cargo install rivet-cli
rivet --version
```

### Pre-built binaries

Download the latest release for your platform from [GitHub Releases](https://github.com/panchenkoai/rivet/releases):

```bash
# macOS (Apple Silicon)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-aarch64-apple-darwin.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-x86_64-apple-darwin.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-x86_64-unknown-linux-gnu.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# Linux (arm64)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-aarch64-unknown-linux-gnu.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/
```

```bash
rivet --version
```

**Verify the download** against the published checksums (every release ships `SHA256SUMS.txt`):

```bash
# Download the tarball + SHA256SUMS.txt from the same release, then:
sha256sum -c SHA256SUMS.txt        # Linux
shasum -a 256 -c SHA256SUMS.txt    # macOS
```

### Docker

```bash
docker run --rm ghcr.io/panchenkoai/rivet:latest --version

docker run --rm \
  -e DATABASE_URL="postgresql://user:pass@host.docker.internal:5432/db" \
  -v $(pwd)/examples/rivet.yaml:/config/rivet.yaml \
  -v $(pwd)/output:/output \
  ghcr.io/panchenkoai/rivet:latest \
  run -c /config/rivet.yaml
```

> From a container, `localhost` is not your machine. Use `host.docker.internal` (Docker Desktop) or `--add-host=host.docker.internal:host-gateway` on Linux. See [Getting Started](docs/getting-started.md) for details.

### Build from source

Requires Rust 1.94+:

```bash
git clone https://github.com/panchenkoai/rivet.git
cd rivet
cargo build --release
# binary is at target/release/rivet
```

---

## Resource-aware extraction

These are production-safety primitives, not performance knobs.

### Memory controls

| Setting | What it controls |
|---------|-----------------|
| `tuning.max_batch_memory_mb` | Hard cap on a single Arrow batch. When exceeded, the `on_batch_memory_exceeded` policy fires. |
| `tuning.on_batch_memory_exceeded` | `warn` (log + continue) · `fail` (abort) · `auto_shrink` (split batch recursively, then continue) |
| `tuning.memory_threshold_mb` | Process-level RSS guard — pauses fetching when RSS exceeds the threshold |
| `tuning.batch_size_memory_mb` | Adaptive batch sizing: Rivet samples the first batch to estimate row width, then adjusts subsequent batch sizes automatically |

### Output controls

| Setting | What it controls |
|---------|-----------------|
| `compression_profile` | `none` / `fast` (Snappy) / `balanced` (Zstd-3) / `compact` (Zstd-9) |
| `parquet.row_group_strategy` | `auto` (schema-based estimate) / `fixed_rows` / `fixed_memory` |
| `parquet.target_row_group_mb` | Target row group size; lower values reduce peak RSS during Parquet writes |

### Quality gates

| Setting | What it controls |
|---------|-----------------|
| `quality.row_count_min` / `row_count_max` | Fail the export if row count is outside this range — fires even when the source returns 0 rows |
| `quality.null_ratio_max` | Fail the export if the null ratio in a column exceeds the threshold |
| `quality.unique_columns` | Track column uniqueness via typed xxHash3-64 hashing |
| `quality.unique_max_entries` | Cap the uniqueness hash set to prevent unbounded memory growth on high-cardinality columns |

### Choosing settings for your environment

| Environment | Recommended starting point |
|-------------|---------------------------|
| Production database (shared) | `profile: safe`, `max_batch_memory_mb: 128`, `on_batch_memory_exceeded: warn` |
| CI / strict pipeline | `max_batch_memory_mb: 128`, `on_batch_memory_exceeded: fail` |
| Low-memory host (1–2 GB) | `profile: safe`, `max_batch_memory_mb: 64`, `on_batch_memory_exceeded: auto_shrink` |
| Read replica / fast backfill | `profile: fast`, `compression_profile: fast` |

See the **[Best Practices guides](docs/best-practices/)** for detailed explanations, trade-off analysis, and worked examples:

- [Resource-aware extraction](docs/best-practices/resource-aware-extraction.md) — memory budgets, policies, RSS formula
- [Parquet tuning](docs/best-practices/parquet-tuning.md) — row group strategies, targets, downstream read implications
- [Compression profiles](docs/best-practices/compression-profiles.md) — profile-to-codec mapping, CPU/size trade-offs
- [Quality checks](docs/best-practices/quality-checks.md) — row count gates, null ratio, uniqueness cap
- [Low-memory runners](docs/best-practices/low-memory-runners.md) — settings for 512 MB–4 GB hosts
- [Recovery and resume](docs/best-practices/recovery-and-resume.md) — `--resume` semantics, crash recovery

---

## Documentation

| Topic | Link |
|-------|------|
| All docs (index) | [docs/README.md](docs/README.md) |
| **First run — install, connect, export** | [docs/getting-started.md](docs/getting-started.md) |
| **Concepts glossary** (`run_id`, `cursor`, `chunk`, `manifest`, `journal`, `progression`) | [docs/concepts.md](docs/concepts.md) |
| **Pilot guide** — full flow on your own database, production-ready | [docs/pilot/README.md](docs/pilot/README.md) |
| **Execution semantics** (crash / retry / resume contract) | [docs/semantics.md](docs/semantics.md) |
| **Reliability matrix** (what's in PR CI / nightly / manual) | [docs/reliability-matrix.md](docs/reliability-matrix.md) |
| **Security policy** (credentials, sensitive artifacts, disclosure) | [SECURITY.md](SECURITY.md) |
| **Cloud permissions** (least-privilege IAM / RBAC / SAS per command) | [docs/cloud-permissions.md](docs/cloud-permissions.md) |
| **Cross-tool benchmark harness** | [docs/bench/](docs/bench/) |
| Export modes (`full`, `incremental`, `chunked`, `time_window`) | [docs/modes/](docs/modes/) |
| Destinations (local, S3, GCS, Azure, stdout) | [docs/destinations/](docs/destinations/) |
| Config YAML reference | [docs/reference/config.md](docs/reference/config.md) |
| CLI commands and flags | [docs/reference/cli.md](docs/reference/cli.md) |
| Tuning profiles | [docs/reference/tuning.md](docs/reference/tuning.md) |
| Scaffold config from a live DB (`rivet init`) | [docs/reference/init.md](docs/reference/init.md) |
| Pipeline, traits, memory model, source layout | [docs/architecture.md](docs/architecture.md) |
| Demo on a pre-seeded 14-table fixture (~10 min) | [docs/pilot/demo-quickstart.md](docs/pilot/demo-quickstart.md) |
| Pilot walkthrough — discovery → reconcile → repair | [docs/pilot/pilot-walkthrough.md](docs/pilot/pilot-walkthrough.md) |
| Production checklist | [docs/pilot/production-checklist.md](docs/pilot/production-checklist.md) |
| Operator recipes (resume, idempotent load) | [docs/recipes/](docs/recipes/) |
| Architecture decision records | [docs/adr/](docs/adr/) |
| Contributing, tests, CI | [CONTRIBUTING.md](CONTRIBUTING.md) |

---

## Releases and roadmap

- **Latest release and version history:** [CHANGELOG.md](CHANGELOG.md).
- **Strategy and execution tracker:** [rivet_roadmap.md](rivet_roadmap.md) — the single source of truth for what is shipped and what is open.
- **Questions, issues, feature requests:** [GitHub Issues](https://github.com/panchenkoai/rivet/issues).