Rivet

Lightweight, source-safe data extraction from PostgreSQL and MySQL to Parquet/CSV.

Rivet is a CLI tool that exports query results from relational databases to files -- locally or in cloud storage (S3, GCS). It is extract-only: no loading, no merging, no CDC. It is designed to be gentle on production databases through tuning profiles, preflight health checks, and intelligent retry with backoff.

What Rivet does

Extracts data from PostgreSQL and MySQL via standard SQL queries
Writes Parquet (zstd-compressed by default; snappy, gzip, lz4, none) or CSV files
Uploads to local disk, Amazon S3, Google Cloud Storage, or stdout (pipe workflows)
Tracks incremental state in SQLite so the next run picks up where the last left off
Diagnoses source health before extraction (rivet check)
Verifies auth for all sources and destinations before running (rivet doctor)
Prints a structured run summary after each export (run ID, rows, files, bytes, duration, RSS, retries, schema changes)
Persists metrics history, schema tracking, and file manifest in SQLite
Recommends parallelism level and tuning profile in preflight checks
Parameterized queries via --param key=value and ${key} placeholders
Data quality checks — row count bounds, null ratio thresholds, uniqueness assertions
File size splitting — max_file_size: 512MB automatically splits output into parts
Memory-based batch sizing — batch_size_memory_mb: 256 auto-tunes batch size from schema width
Slack notifications on failure, schema change, or degraded verdict

What Rivet does NOT do

No loading/merging -- it produces files; you bring them into a warehouse yourself
No CDC -- no WAL/binlog reading; query-based extraction only
No orchestration -- no built-in scheduler; use cron, Airflow, or similar
No exactly-once delivery -- at-least-once; duplicates are possible (see Execution Semantics)
No web UI / API -- CLI and YAML config only

Documentation language: English-only. See CONTRIBUTING.md.

New to Rivet? Start with the Pilot Documentation — step-by-step guides for every export mode, destination, and YAML parameter, plus quickstart templates for your first export.

Installation

Homebrew (macOS / Linux) — recommended

brew tap panchenkoai/rivet
brew update
brew install rivet
rivet --version

Pre-built binaries

Download the latest release for your platform from GitHub Releases:

# macOS (Apple Silicon)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-aarch64-apple-darwin.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-x86_64-apple-darwin.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-x86_64-unknown-linux-gnu.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# Linux (arm64)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-aarch64-unknown-linux-gnu.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

Verify:

rivet --version

Docker

Try Rivet without installing anything — mount your config and output directory:

docker run --rm \
  -v $(pwd)/rivet.yaml:/config/rivet.yaml \
  -v $(pwd)/output:/output \
  ghcr.io/panchenkoai/rivet:latest \
  run --config /config/rivet.yaml

Or check the version and explore commands:

docker run --rm ghcr.io/panchenkoai/rivet:latest --version
docker run --rm ghcr.io/panchenkoai/rivet:latest --help

Pass environment variables for credentials:

docker run --rm \
  -e DATABASE_URL="postgresql://user:pass@host:5432/db" \
  -v $(pwd)/rivet.yaml:/config/rivet.yaml \
  -v $(pwd)/output:/output \
  ghcr.io/panchenkoai/rivet:latest \
  run --config /config/rivet.yaml

Note: To connect to a database running on your host machine, use host.docker.internal instead of localhost in the connection URL.

Build from source

Requires Rust 1.94+:

git clone https://github.com/panchenkoai/rivet.git
cd rivet
cargo build --release
# binary is at target/release/rivet

Quick Start

Create a config file rivet.yaml:

source:
  type: postgres
  url: "postgresql://user:pass@localhost:5432/mydb"
  tuning:
    profile: safe

exports:
  - name: users
    query: "SELECT id, name, email, updated_at FROM users"
    mode: incremental
    cursor_column: updated_at
    format: parquet
    destination:
      type: local
      path: ./output

Run preflight check to diagnose source health:

rivet check --config rivet.yaml

Verify auth for source and all destinations:

rivet doctor --config rivet.yaml

Run the export:

RUST_LOG=info rivet run --config rivet.yaml

Check state:

rivet state show --config rivet.yaml

Working with the binary

Once installed, rivet is a single self-contained binary with no runtime dependencies (no JVM, no Python, no Docker required).

Typical workflow:

# 1. Preflight: check that the source DB is reachable and healthy
rivet check --config rivet.yaml

# 2. Auth: verify credentials for source + all destinations (S3, GCS, etc.)
rivet doctor --config rivet.yaml

# 3. Export: run all exports defined in the config
RUST_LOG=info rivet run --config rivet.yaml

# 4. Inspect: view cursor state and file manifest
rivet state show --config rivet.yaml
rivet state files --config rivet.yaml

# 5. Re-run: only new/changed rows are exported (incremental mode)
RUST_LOG=info rivet run --config rivet.yaml

Useful flags:

rivet run --config rivet.yaml --export users      # run a single export
rivet run --config rivet.yaml --validate           # reconcile row counts after write
rivet run --config rivet.yaml --param env=prod     # parameterized queries
rivet state reset --config rivet.yaml --export users  # reset cursor to re-export from scratch

Logging:

Rivet uses RUST_LOG for verbosity:

RUST_LOG=debug rivet run --config rivet.yaml    # verbose (SQL, batch timings, retries)
RUST_LOG=info  rivet run --config rivet.yaml    # normal (progress, summary)
RUST_LOG=warn  rivet run --config rivet.yaml    # quiet (errors and warnings only)

Shell completions:

# Bash
rivet completions bash > ~/.local/share/bash-completion/completions/rivet

# Zsh
rivet completions zsh > ~/.zfunc/_rivet

# Fish
rivet completions fish > ~/.config/fish/completions/rivet.fish

CLI Reference

rivet run --config <path>                          # run all exports
rivet run --config <path> --export <name>          # run a specific export
rivet run --config <path> --validate               # verify row counts after write
rivet check --config <path>                        # preflight check all exports
rivet check --config <path> --export <name>        # preflight check one export
rivet doctor --config <path>                       # verify source + destination auth
rivet state show --config <path>                   # show cursor state
rivet state reset --config <path> --export <name>  # reset cursor
rivet state files --config <path>                  # show file manifest (which run created which files)
rivet metrics --config <path>                      # show export run history
rivet metrics --config <path> --export <name>      # metrics for one export
rivet metrics --config <path> --last N             # last N runs (default 20)
rivet completions <shell>                          # generate shell completions (bash|zsh|fish|powershell)

Shell completions:

# zsh (add to ~/.zshrc)
rivet completions zsh > ~/.zfunc/_rivet

# bash
rivet completions bash > /etc/bash_completion.d/rivet

# fish
rivet completions fish > ~/.config/fish/completions/rivet.fish

Set RUST_LOG=info (or debug) for detailed logging:

RUST_LOG=info rivet run --config rivet.yaml

Choosing a Mode

Mode	Best for	Key behavior
`full`	Small tables, snapshots, one-off exports	Exports entire query result every run
`incremental`	Append-only or update-tracked tables	Resumes from the last exported value of `cursor_column`
`chunked`	Very large tables (10M+ rows)	Splits into ID-range windows; supports `parallel > 1` for concurrent extraction
`time_window`	Event logs, append-mostly data with timestamps	Exports only the last N days based on a time/date column

Decision rules:

Table < 1M rows, full snapshot needed -- use full.
Table has a monotonically increasing column (auto-increment id, updated_at with triggers) -- use incremental. This is the most efficient mode for repeated runs.
Table is very large and you need parallel extraction -- use chunked with parallel > 1. Set chunk_column to the primary key. Watch out for sparse IDs (see Sparse IDs).
You only need recent data (e.g. last 7 days of events) -- use time_window. Set time_column and days_window.

Can I combine modes? No. Each export uses exactly one mode. If you need both incremental tracking and chunked extraction, use incremental for ongoing syncs and chunked for backfills.

Choosing a Profile

Profile	Source environment	Behavior
`fast`	Dedicated replica, data warehouse, trusted environment	Large batches, no throttle, no timeouts, minimal retries
`balanced`	General-purpose source, moderate concurrent load	10K batch, 50ms throttle, 5-min statement timeout, 3 retries
`safe`	Production OLTP, shared resources, fragile source	Small batches, 500ms throttle, 2-min timeout, 5 retries with long backoff

Decision rules:

Dedicated read replica or analytics database -- fast. You own the capacity.
Production database with other workloads -- balanced. Good default.
Production OLTP under high load, or a database you don't fully control -- safe. Rivet backs off aggressively and retries patiently.

You can always override individual fields (e.g. profile: safe with batch_size: 5000).

Config Reference

Source

Two mutually exclusive styles for specifying database credentials:

URL-based -- set exactly one of url, url_env, or url_file:

source:
  type: postgres   # postgres | mysql
  url: "postgresql://user:pass@host:port/db"
  tuning:          # optional, defaults to balanced
    profile: safe  # safe | balanced | fast

source:
  type: mysql
  url_env: DATABASE_URL   # read full URL from this env var

Structured -- specify individual connection fields:

source:
  type: postgres
  host: db.example.com
  port: 5433              # optional; defaults to 5432 (PG) / 3306 (MySQL)
  user: admin
  password_env: DB_PASS   # reads password from env var; or use 'password: literal'
  database: mydb

Field	Required	Notes
`host`	yes
`user`	yes
`database`	yes
`port`	no	defaults to 5432 (postgres) / 3306 (mysql)
`password`	no	plaintext; prefer `password_env`
`password_env`	no	env var name containing the password

URL-based and structured fields cannot be mixed. If both are present, validation rejects the config with a clear error.

Source Tuning

Controls how aggressively rivet reads from the database. Three named profiles with individual field overrides:

source:
  type: postgres
  url: "..."
  tuning:
    profile: safe              # base profile
    batch_size: 3000           # override: rows per fetch
    throttle_ms: 300           # override: sleep between fetches
    statement_timeout_s: 60    # override: per-query timeout

Profile Defaults

Parameter	`fast`	`balanced` (default)	`safe`
`batch_size`	50,000	10,000	2,000
`throttle_ms`	0	50	500
`statement_timeout_s`	0 (none)	300	120
`max_retries`	1	3	5
`retry_backoff_ms`	1,000	2,000	5,000
`lock_timeout_s`	0 (none)	30	10

When to use each profile:

fast -- trusted environment, dedicated replica, need maximum throughput
balanced -- general purpose, moderate load on source
safe -- production OLTP database, shared resources, fragile source

If no tuning section is specified, balanced is used.

Exports

Each export defines a query, format, mode, and destination:

exports:
  - name: my_export            # unique name, used for state tracking
    query: "SELECT ..."        # SQL query to execute
    mode: full                 # full | incremental
    cursor_column: updated_at  # required for incremental mode
    format: parquet            # parquet | csv
    destination:
      type: local              # local | s3 | gcs
      path: ./output           # local: output directory

Meta Columns

Add metadata columns to every output row -- useful for deduplication and lineage on the raw/staging layer.

exports:
  - name: page_views
    query: "SELECT * FROM page_views"
    format: parquet
    meta_columns:
      exported_at: true   # adds _rivet_exported_at (UTC timestamp)
      row_hash: true      # adds _rivet_row_hash (xxh3_128 hex)
    destination:
      type: gcs
      bucket: my-bucket

Column	Type	Description
`_rivet_exported_at`	`Timestamp(us, UTC)`	When the batch was exported (same value for all rows in a batch)
`_rivet_row_hash`	`Int64`	Lower 64 bits of xxHash3-128 over all column values. Integer for fast `PARTITION BY` / `JOIN`.

Dedup pattern (e.g. in BigQuery / DuckDB):

SELECT * FROM raw_page_views
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY _rivet_row_hash
  ORDER BY _rivet_exported_at DESC
) = 1

Both fields are optional and default to false. When disabled, no extra columns are added.

Compression

Parquet compression is configurable per export. Default: zstd (better compression ratio than Snappy at comparable speed).

exports:
  - name: orders
    query: "SELECT * FROM orders"
    format: parquet
    compression: zstd            # zstd | snappy | gzip | lz4 | none
    compression_level: 9         # optional; zstd 1..22 (default 3), gzip 0..10 (default 6)
    destination:
      type: local
      path: ./output

Codec	Default level	Notes
`zstd`	3	Best ratio/speed tradeoff; new default
`snappy`	—	Fast, modest compression; previous default
`gzip`	6	Wide compatibility
`lz4`	—	Very fast decompression
`none`	—	No compression; largest files

CSV exports ignore the compression setting.

Skip Empty Exports

When running scheduled/incremental exports, zero new rows often means nothing changed. Use skip_empty to avoid creating empty files:

exports:
  - name: events_inc
    query: "SELECT * FROM events"
    mode: incremental
    cursor_column: updated_at
    format: parquet
    skip_empty: true             # no file created when 0 rows; cursor not advanced
    destination:
      type: gcs
      bucket: my-bucket

When skip_empty: true and the query returns 0 rows:

No output file is created or uploaded
Cursor state is not advanced (safe to rerun)
Run summary shows status: skipped

Default: false (current behavior; 0-row exports still succeed with no file output).

Destinations

Local filesystem:

destination:
  type: local
  path: ./output

Amazon S3:

destination:
  type: s3
  bucket: my-bucket
  prefix: exports/data/
  region: us-east-1
  endpoint: https://...       # optional, for S3-compatible storage

Credentials: either omit key env fields and use the default AWS chain, or set both access_key_env and secret_key_env. Details: Credential precedence.

Google Cloud Storage:

destination:
  type: gcs
  bucket: my-bucket
  prefix: exports/data/
  endpoint: https://...       # optional
  credentials_file: /path/to/sa.json   # optional; omit to use ADC / env (see below)

GCS -- credentials: see Credential precedence. For day-to-day use on a workstation with a Google Cloud project, run gcloud auth application-default login and omit credentials_file; Rivet then uses Application Default Credentials (ADC).

Stdout (pipe to another tool):

destination:
  type: stdout

Writes file contents directly to stdout. Useful for piping into gzip, aws s3 cp -, or other streaming consumers. Only practical with a single export (multiple exports would interleave output).

Credential precedence

Rivet uses one predictable model for where secrets come from. Think of four layers (highest priority first). A higher layer wins when it applies; Rivet does not merge multiple cloud credential sources for the same destination.

Priority	Layer	Meaning
1	Config	Fields in `rivet.yaml` (URLs, `credentials_file`, names of env vars for S3 keys).
2	Environment variables	Process environment (`DATABASE_URL` via `url_env`, `${VAR}` expansion in `url`, `GOOGLE_APPLICATION_CREDENTIALS`, standard `AWS_*` variables).
3	ADC / instance identity	Provider default credentials with no explicit path in Rivet config (e.g. GCE/GKE metadata; local user ADC from `gcloud auth application-default login`).
4	File-based material	Secret content read from disk when a path is chosen by config or environment (e.g. `url_file`, `credentials_file`, or the file pointed to by `GOOGLE_APPLICATION_CREDENTIALS`). This is not a separate "guess"; it is always wired through layer 1 or 2.

Database (PostgreSQL / MySQL)

Two mutually exclusive styles:

URL-based -- set exactly one of source.url, source.url_env, or source.url_file. There is no fallback between them.

Mechanism	Resolution
`url`	Connection string from config. Placeholders `${VAR}` are expanded from the environment when the config file is loaded (missing variables become empty).
`url_env`	The entire URL is read from the named environment variable.
`url_file`	The entire URL is read from the file path given in config (trimmed).

Structured -- set host, user, database (and optionally port, password / password_env). Rivet builds the connection URL internally.

Cloud "ADC" does not apply to database URLs.

Google Cloud Storage (GCS)

Step	Source
1	If `destination.credentials_file` is set -- use only that service account JSON path (config overrides env).
2	Else -- OpenDAL uses Google's default loader: `GOOGLE_APPLICATION_CREDENTIALS` (if set) -- JSON file at that path.
3	Else -- user ADC file from `gcloud auth application-default login` (well-known path under gcloud config).
4	Else -- GCE/GKE metadata-based service account when running on Google Cloud.

If you omit credentials_file, set RUST_LOG=info and look for a log line stating that the default Google credential chain is in use.

Amazon S3

Step	Source
1	If both `access_key_env` and `secret_key_env` are set -- read access key and secret only from those variable names (error if unset).
2	If neither is set -- OpenDAL's default AWS chain: environment variables, shared config files (e.g. `~/.aws/credentials`), then EC2/ECS instance metadata (IAM role).

Setting only one of access_key_env or secret_key_env is invalid and rejected at config validation.

Auth Diagnostics

rivet doctor verifies that source and destination credentials are valid before you run any exports:

$ rivet doctor --config rivet.yaml

rivet doctor: verifying auth for config 'rivet.yaml'

[OK]  Config parsed successfully
[OK]  Source auth (Postgres)
[OK]  Destination S3(my-bucket)
[FAIL] Destination GCS(other-bucket) -- auth error: loading credential ...

Some checks failed. Fix the issues above before running exports.

Error categories:

Category	Meaning
`auth error`	Credentials are missing, expired, or rejected
`connectivity error`	Cannot reach the host (DNS, firewall, timeout)
`bucket not found`	Bucket or path does not exist
`error`	Other / uncategorized

Preflight Check

rivet check analyzes each export before running it. It connects to the source database, runs EXPLAIN on each query, and reports strategy, row estimates, verdicts, profile recommendations, and warnings:

$ rivet check --config rivet.yaml

Export: orders_incremental
  Strategy:     incremental(updated_at)
  Mode:         incremental (cursor: updated_at)
  Row estimate: ~1M
  Cursor range: 2024-01-01 .. 2025-01-30
  Scan type:    Index Scan using idx_orders_updated_at
  Verdict:      EFFICIENT
  Recommended:  tuning.profile: fast

Export: events_full
  Strategy:     full-scan
  Mode:         full
  Row estimate: ~5M
  Scan type:    Seq Scan on events
  Verdict:      DEGRADED
  Recommended:  tuning.profile: safe
  Suggestion:   No index detected -- full table scan. Add an indexed cursor
                column and switch to incremental mode. Use 'safe' tuning
                profile to limit database impact.

Export: orders_chunked
  Strategy:     chunked-parallel(id, size=100000, p=4)
  Mode:         chunked (column: id, size: 100000)
  Row estimate: ~10M
  Cursor range: 1 .. 50000000
  Scan type:    Index Scan using orders_pkey
  Verdict:      ACCEPTABLE
  Recommended:  tuning.profile: safe
  Warning:      Sparse key range: ~99% of chunk windows will be empty ...
  Suggestion:   Large dataset (~10M rows). Add parallel > 1 to speed up ...

Strategy Names

Strategy	When
`full-scan`	`mode: full`, parallel=1
`full-parallel(N)`	`mode: full`, parallel > 1
`incremental(col)`	`mode: incremental`
`chunked(col, size=N)`	`mode: chunked`, parallel=1
`chunked-parallel(col, size=N, p=P)`	`mode: chunked`, parallel > 1
`time-window(col, Nd)`	`mode: time_window`

Profile Recommendation

rivet check recommends a tuning profile based on row estimate and index usage:

Condition	Recommendation
Indexed, < 1M rows	`fast`
Indexed, 1M-10M rows	`balanced`
Indexed, > 10M rows	`safe`
No index, < 100K rows	`fast` (or `balanced` with parallel)
No index, 100K-1M rows	`balanced`
No index, > 1M rows	`safe`

Warnings

Warning	Trigger
Sparse key range	Chunked mode with < 10% density (range >> row count)
Dense surrogate sort cost	Query uses `ROW_NUMBER()` in chunked mode
Parallel memory risk	`parallel > 1` on > 5M rows

Verdicts

Verdict	Meaning
`EFFICIENT`	Index scan on cursor column, reasonable row count (< 10M)
`ACCEPTABLE`	Index scan but very large dataset, or partial index coverage
`DEGRADED`	Full table scan detected, but row count is manageable
`UNSAFE`	Full scan on very large table (> 50M rows) without index support

Suggestions are mode-aware: full exports recommend switching to incremental, chunked exports recommend indexing the chunk column, time-window exports recommend indexing the time column.

Incremental Mode

When mode: incremental is set, rivet:

Reads the last exported cursor value from its SQLite state database
Appends WHERE <cursor_column> > '<last_value>' to the query
After a successful export, updates the cursor to the last row's value

The state database (.rivet_state.db) is created next to your config file.

Chunked Extraction

Rivet never loads an entire table into memory with a single query. Instead:

PostgreSQL: Uses server-side cursors (DECLARE CURSOR / FETCH N) to read batch_size rows at a time
MySQL: Uses streaming result sets (query_iter()) to read rows incrementally

Between each batch, rivet sleeps for throttle_ms milliseconds, giving the database breathing room.

Sparse IDs (gaps in the key range)

Chunked mode uses MIN(chunk_column) and MAX(chunk_column) from your export query, then issues WHERE chunk_column BETWEEN start AND end for each window. If the primary key is sparse (huge spread between min and max, few rows), most windows cover no rows but the database still plans and scans for each range.

Mitigation: chunk on a dense surrogate computed in SQL, for example ROW_NUMBER() OVER (ORDER BY id) AS chunk_rownum, and set chunk_column: chunk_rownum in the export. Then min/max match the row count, not the physical id span. A commented PostgreSQL example lives at tests/fixtures/migrations/001_sparse_chunk_column_example.sql.

Cost tradeoff: ORDER BY id (and therefore that window) is not free. The planner usually needs a global ordering of the rows you export: often a sort over the whole result, or an index scan on id if the shape of the query allows it -- either way you pay once per export pass, and under concurrent writes the ordering is tied to a snapshot. You are trading many cheap-but-useless BETWEEN probes on a sparse key for fewer chunk queries that each touch real rows, at the price of establishing dense row numbers. For very large or hot tables, prefer incremental mode on an indexed cursor column, a precomputed dense key (column or side table populated by batch jobs), or a materialized view refreshed off the critical path, if that fits your workload better than a window over live data.

Run Summary

After each export, Rivet prints a structured summary to stdout:

── orders ──
  run_id:      orders_20260329T125109.336
  status:      success
  rows:        150000
  files:       1
  bytes:       12.4 MB
  duration:    3.2s
  peak RSS:    142MB
  validated:   pass
  schema:      unchanged

All summary fields are also persisted to the metrics table and visible via rivet metrics. The run_id links the summary to the corresponding rows in export_metrics and file_manifest tables.

Field	Description
`run_id`	Canonical identifier for this run (links summary, metrics, and files)
`status`	`success` or `failed`
`rows`	Total rows extracted
`files`	Number of files produced (1 for single-file modes; N for chunked)
`bytes`	Total file size before upload
`duration`	Wall-clock time for the export
`peak RSS`	Peak process RSS during the export (MB)
`retries`	Number of retry attempts (0 if no retries needed)
`validated`	`pass` if `--validate` succeeded; omitted if not requested
`schema`	`unchanged` or `CHANGED`; omitted on first run
`error`	Error message (only on failure)

File manifest

Every file produced by Rivet is recorded in the file_manifest table. Use rivet state files to inspect:

$ rivet state files --config rivet.yaml
RUN ID                              FILE                                         ROWS      BYTES CREATED
--------------------------------------------------------------------------------------------------------------
orders_20260329T125143.912          orders_20260329_125200_chunk3.parquet        50000    17.4 MB 2026-03-29T12:52:00+00:00
orders_20260329T125143.912          orders_20260329_125156_chunk2.parquet        50000    17.4 MB 2026-03-29T12:51:56+00:00

This enables post-run reconciliation: verify which run created which files and confirm row counts match expectations.

Execution Semantics

Export lifecycle

Every export follows a strict sequence. Steps that fail cause the entire export to fail; state is never updated on failure.

1. Config load + validation
2. State read (load cursor for incremental; load schema for tracking)
3. Source connect (new connection per attempt)
4. Query start
   - full/incremental: single query
   - chunked: detect min/max, generate range queries
   - time_window: rewrite query with WHERE clause
5. Batch loop
   a. FETCH batch_size rows → Arrow RecordBatch
   b. FormatWriter.write_batch() → temp file (flush per batch)
   c. Sleep throttle_ms
   d. Repeat until source exhausted
6. FormatWriter.finish() → finalize temp file
7. Validate (if --validate): read back temp file, compare row count
8. Destination.write() → upload temp file to local/S3/GCS
9. State update (incremental only): advance cursor to last row's value
10. Schema tracking: compare columns with stored schema, warn on change
11. Metrics: record run result (duration, rows, RSS, status)

State update point

The cursor advances only after step 8 (successful upload). If any step fails, the cursor stays at its previous value. This means:

A failed export can be safely re-run without skipping data.
A successful upload followed by a process crash before step 9 causes the next run to re-export rows already uploaded (at-least-once semantics -- see Duplicates below).

Duplicates

Rivet provides at-least-once delivery. Duplicates can occur in these scenarios:

Scenario	Cause	Mitigation
Crash after upload, before cursor update	Cursor is not advanced; next run re-exports the same window	Downstream dedup on primary key + cursor column
`time_window` with overlapping windows	Rows near the boundary appear in consecutive windows	Downstream dedup or idempotent merge
`incremental` with non-monotonic cursor	Rows inserted with cursor values older than the last exported value are missed; rows updated after export may be re-exported	Use a strictly monotonic column (e.g. auto-increment id, `updated_at` with triggers)
`chunked` with concurrent writes	New rows inserted during export may land in already-processed ranges	Accept overlap or run during quiescent periods

Rivet never claims exactly-once delivery. Design downstream pipelines to tolerate duplicates.

Retry semantics

On failure, Rivet classifies the error and decides whether to retry:

Category	Retry?	Reconnect?	Extra delay	Examples
Network	yes	yes	--	connection reset, broken pipe, DNS, SSL, EOF
MySQL disconnect	yes	yes	--	server gone away, lost connection
Timeout	yes	no	--	statement timeout, lock wait timeout
Capacity	yes	yes	+15s	too many connections, DB starting/shutting down
Deadlock	yes	no	+1s	deadlock detected, serialization failure
Auth/permission	no	--	--	permission denied, access denied, invalid credentials
Permanent	no	--	--	syntax error, table not found, column not found

On each retry, a fresh connection is created (never reuses a failed connection). Backoff is exponential: retry_backoff_ms * 2^(attempt-1) + extra_delay. The tuning profile controls max_retries and retry_backoff_ms.

Validation semantics

--validate re-reads the temp file after writing and compares the row count against the number of rows received from the source:

Parquet: opens the file with the Arrow reader and reads num_rows from footer metadata.
CSV: counts newlines (excluding header).

What it proves: the file on disk contains the expected number of rows (catches truncated writes, corrupt footers, I/O errors during flush).

What it does not prove:

Cell-level correctness (no checksum on individual values).
Source-to-file semantic equivalence (no re-query of the database to compare).
Post-upload integrity (the file is validated before upload, not after).

Supported Type Mappings

PostgreSQL	MySQL	Arrow / Parquet
`BOOL`	`BIT`	Boolean
`INT2` / `SMALLINT`	`TINYINT`, `SMALLINT`	Int16
`INT4` / `INT`	`INT`, `MEDIUMINT`	Int32
`INT8` / `BIGINT`	`BIGINT`	Int64
`FLOAT4`	`FLOAT`	Float32
`FLOAT8`	`DOUBLE`	Float64
`TEXT`, `VARCHAR`	`VARCHAR`, `TEXT`	Utf8 (String)
`BYTEA`	`BLOB` (binary charset)	Binary
`DATE`	`DATE`	Date32
`TIMESTAMP(TZ)`	`DATETIME`, `TIMESTAMP`	Timestamp(us)
`NUMERIC`	`DECIMAL`	Utf8 (stringified)
`JSON` / `JSONB`	`JSON`	Utf8
`UUID`	--	Utf8

Guarantees and Limitations

What Rivet guarantees

At-least-once delivery: if an export succeeds, all rows matching the query are written to at least one output file.
State atomicity per export: cursor state is updated only after successful upload. A crash mid-export does not advance the cursor.
Schema change detection: Rivet warns when columns are added, removed, or change type between runs.
Validation on demand: --validate confirms row counts match between source read and file on disk.
Predictable auth: credentials are resolved in a documented 4-layer order; no silent fallback surprises.

What Rivet does NOT guarantee

No exactly-once delivery: duplicates can occur on crash recovery, overlapping windows, or non-monotonic cursors.
No cell-level validation: --validate checks row count, not individual cell values or checksums.
No CDC / real-time: Rivet runs point-in-time queries; it does not read WAL, binlog, or change streams.
No load / merge: Rivet produces files. Loading them into a warehouse is your responsibility.
No distributed execution: Rivet runs on a single machine. parallel spawns threads, not remote workers.
No transactional consistency across exports: each export runs its own query; there is no cross-export snapshot isolation.
No encryption: output files are written in plaintext. Encrypt at the destination level if needed.

See Execution Semantics for detailed lifecycle, state update, duplicate, retry, and validation rules.

Development

For pilot documentation (per-mode guides, destination setup, annotated YAML examples), see docs/.

For a step-by-step onboarding guide (from installation to production-ready exports), see USER_GUIDE.md.

For a manual user acceptance checklist (CLI, modes, destinations, compression, skip-empty), see USER_TEST_PLAN.md.

Local Setup

Start PostgreSQL and MySQL with Docker:

docker compose up -d

Seed both databases with test data (100K users, ~1M orders, ~5M events):

cargo run --release --bin seed -- --target both --users 100000

The seed tool supports flags:

--target postgres|mysql|both    # which database to seed
--users N                       # number of users (default: 100000)
--orders-per-user N             # avg orders per user (default: 10)
--events-per-user N             # avg events per user (default: 50)
--batch-size N                  # insert batch size (default: 1000)
--pg-url URL                    # PostgreSQL connection URL
--mysql-url URL                 # MySQL connection URL

Toolchain

The project pins Rust 1.94 via rust-toolchain.toml. Install with:

rustup install 1.94

Running Tests

cargo test              # 617 unit + integration tests (no database needed)
cargo test -- --nocapture  # with output
cargo clippy --all-targets -- -D warnings  # lint check
cargo fmt --all -- --check                 # format check

End-to-end scripts (Docker Compose must be up, rivet built):

bash dev/test_permissions.sh
bash dev/test_schema_evolution.sh

CI

GitHub Actions runs on every push/PR to master/main:

Rustfmt — formatting check
Clippy — lint check with -D warnings
Tests — full test suite
Release build — ensures cargo build --release succeeds
Security audit — cargo audit via rustsec/audit-check

Roadmap

See rivet_roadmap.md for the full roadmap (strategy + execution status).

Next milestones:

Milestone	Focus
v0.2.0 (stable)	Cross-platform release binaries, E2E test matrix, `cargo publish`, Docker image
v0.3.0	Source count reconciliation, crash/recovery tests, data shape drift detection, curated example configs
Future	CDC mode, Iceberg/Delta output, webhook destination, multi-source joins, plugin system

rivet-cli 0.2.0-beta.2