rivet-cli 0.2.0-beta.2

CLI tool to export PostgreSQL and MySQL to Parquet/CSV (local, S3, GCS) with tuning, preflight checks, and SQLite-backed state.
Documentation

Rivet

Lightweight, source-safe data extraction from PostgreSQL and MySQL to Parquet/CSV.

Rivet is a CLI tool that exports query results from relational databases to files -- locally or in cloud storage (S3, GCS). It is extract-only: no loading, no merging, no CDC. It is designed to be gentle on production databases through tuning profiles, preflight health checks, and intelligent retry with backoff.

What Rivet does

  • Extracts data from PostgreSQL and MySQL via standard SQL queries
  • Writes Parquet (zstd-compressed by default; snappy, gzip, lz4, none) or CSV files
  • Uploads to local disk, Amazon S3, Google Cloud Storage, or stdout (pipe workflows)
  • Tracks incremental state in SQLite so the next run picks up where the last left off
  • Diagnoses source health before extraction (rivet check)
  • Verifies auth for all sources and destinations before running (rivet doctor)
  • Prints a structured run summary after each export (run ID, rows, files, bytes, duration, RSS, retries, schema changes)
  • Persists metrics history, schema tracking, and file manifest in SQLite
  • Recommends parallelism level and tuning profile in preflight checks
  • Parameterized queries via --param key=value and ${key} placeholders
  • Data quality checks — row count bounds, null ratio thresholds, uniqueness assertions
  • File size splittingmax_file_size: 512MB automatically splits output into parts
  • Memory-based batch sizingbatch_size_memory_mb: 256 auto-tunes batch size from schema width
  • Slack notifications on failure, schema change, or degraded verdict

What Rivet does NOT do

  • No loading/merging -- it produces files; you bring them into a warehouse yourself
  • No CDC -- no WAL/binlog reading; query-based extraction only
  • No orchestration -- no built-in scheduler; use cron, Airflow, or similar
  • No exactly-once delivery -- at-least-once; duplicates are possible (see Execution Semantics)
  • No web UI / API -- CLI and YAML config only

Documentation language: English-only. See CONTRIBUTING.md.

New to Rivet? Start with the Pilot Documentation — step-by-step guides for every export mode, destination, and YAML parameter, plus quickstart templates for your first export.

Installation

Homebrew (macOS / Linux) — recommended

brew tap panchenkoai/rivet
brew update
brew install rivet
rivet --version

Pre-built binaries

Download the latest release for your platform from GitHub Releases:

# macOS (Apple Silicon)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-aarch64-apple-darwin.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-x86_64-apple-darwin.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-x86_64-unknown-linux-gnu.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# Linux (arm64)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-aarch64-unknown-linux-gnu.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

Verify:

rivet --version

Docker

Try Rivet without installing anything — mount your config and output directory:

docker run --rm \
  -v $(pwd)/rivet.yaml:/config/rivet.yaml \
  -v $(pwd)/output:/output \
  ghcr.io/panchenkoai/rivet:latest \
  run --config /config/rivet.yaml

Or check the version and explore commands:

docker run --rm ghcr.io/panchenkoai/rivet:latest --version
docker run --rm ghcr.io/panchenkoai/rivet:latest --help

Pass environment variables for credentials:

docker run --rm \
  -e DATABASE_URL="postgresql://user:pass@host:5432/db" \
  -v $(pwd)/rivet.yaml:/config/rivet.yaml \
  -v $(pwd)/output:/output \
  ghcr.io/panchenkoai/rivet:latest \
  run --config /config/rivet.yaml

Note: To connect to a database running on your host machine, use host.docker.internal instead of localhost in the connection URL.

Build from source

Requires Rust 1.94+:

git clone https://github.com/panchenkoai/rivet.git
cd rivet
cargo build --release
# binary is at target/release/rivet

Quick Start

  1. Create a config file rivet.yaml:
source:
  type: postgres
  url: "postgresql://user:pass@localhost:5432/mydb"
  tuning:
    profile: safe

exports:
  - name: users
    query: "SELECT id, name, email, updated_at FROM users"
    mode: incremental
    cursor_column: updated_at
    format: parquet
    destination:
      type: local
      path: ./output
  1. Run preflight check to diagnose source health:
rivet check --config rivet.yaml
  1. Verify auth for source and all destinations:
rivet doctor --config rivet.yaml
  1. Run the export:
RUST_LOG=info rivet run --config rivet.yaml
  1. Check state:
rivet state show --config rivet.yaml

Working with the binary

Once installed, rivet is a single self-contained binary with no runtime dependencies (no JVM, no Python, no Docker required).

Typical workflow:

# 1. Preflight: check that the source DB is reachable and healthy
rivet check --config rivet.yaml

# 2. Auth: verify credentials for source + all destinations (S3, GCS, etc.)
rivet doctor --config rivet.yaml

# 3. Export: run all exports defined in the config
RUST_LOG=info rivet run --config rivet.yaml

# 4. Inspect: view cursor state and file manifest
rivet state show --config rivet.yaml
rivet state files --config rivet.yaml

# 5. Re-run: only new/changed rows are exported (incremental mode)
RUST_LOG=info rivet run --config rivet.yaml

Useful flags:

rivet run --config rivet.yaml --export users      # run a single export
rivet run --config rivet.yaml --validate           # reconcile row counts after write
rivet run --config rivet.yaml --param env=prod     # parameterized queries
rivet state reset --config rivet.yaml --export users  # reset cursor to re-export from scratch

Logging:

Rivet uses RUST_LOG for verbosity:

RUST_LOG=debug rivet run --config rivet.yaml    # verbose (SQL, batch timings, retries)
RUST_LOG=info  rivet run --config rivet.yaml    # normal (progress, summary)
RUST_LOG=warn  rivet run --config rivet.yaml    # quiet (errors and warnings only)

Shell completions:

# Bash
rivet completions bash > ~/.local/share/bash-completion/completions/rivet

# Zsh
rivet completions zsh > ~/.zfunc/_rivet

# Fish
rivet completions fish > ~/.config/fish/completions/rivet.fish

CLI Reference

rivet run --config <path>                          # run all exports
rivet run --config <path> --export <name>          # run a specific export
rivet run --config <path> --validate               # verify row counts after write
rivet check --config <path>                        # preflight check all exports
rivet check --config <path> --export <name>        # preflight check one export
rivet doctor --config <path>                       # verify source + destination auth
rivet state show --config <path>                   # show cursor state
rivet state reset --config <path> --export <name>  # reset cursor
rivet state files --config <path>                  # show file manifest (which run created which files)
rivet metrics --config <path>                      # show export run history
rivet metrics --config <path> --export <name>      # metrics for one export
rivet metrics --config <path> --last N             # last N runs (default 20)
rivet completions <shell>                          # generate shell completions (bash|zsh|fish|powershell)

Shell completions:

# zsh (add to ~/.zshrc)
rivet completions zsh > ~/.zfunc/_rivet

# bash
rivet completions bash > /etc/bash_completion.d/rivet

# fish
rivet completions fish > ~/.config/fish/completions/rivet.fish

Set RUST_LOG=info (or debug) for detailed logging:

RUST_LOG=info rivet run --config rivet.yaml

Choosing a Mode

Mode Best for Key behavior
full Small tables, snapshots, one-off exports Exports entire query result every run
incremental Append-only or update-tracked tables Resumes from the last exported value of cursor_column
chunked Very large tables (10M+ rows) Splits into ID-range windows; supports parallel > 1 for concurrent extraction
time_window Event logs, append-mostly data with timestamps Exports only the last N days based on a time/date column

Decision rules:

  1. Table < 1M rows, full snapshot needed -- use full.
  2. Table has a monotonically increasing column (auto-increment id, updated_at with triggers) -- use incremental. This is the most efficient mode for repeated runs.
  3. Table is very large and you need parallel extraction -- use chunked with parallel > 1. Set chunk_column to the primary key. Watch out for sparse IDs (see Sparse IDs).
  4. You only need recent data (e.g. last 7 days of events) -- use time_window. Set time_column and days_window.

Can I combine modes? No. Each export uses exactly one mode. If you need both incremental tracking and chunked extraction, use incremental for ongoing syncs and chunked for backfills.

Choosing a Profile

Profile Source environment Behavior
fast Dedicated replica, data warehouse, trusted environment Large batches, no throttle, no timeouts, minimal retries
balanced General-purpose source, moderate concurrent load 10K batch, 50ms throttle, 5-min statement timeout, 3 retries
safe Production OLTP, shared resources, fragile source Small batches, 500ms throttle, 2-min timeout, 5 retries with long backoff

Decision rules:

  1. Dedicated read replica or analytics database -- fast. You own the capacity.
  2. Production database with other workloads -- balanced. Good default.
  3. Production OLTP under high load, or a database you don't fully control -- safe. Rivet backs off aggressively and retries patiently.

You can always override individual fields (e.g. profile: safe with batch_size: 5000).

Config Reference

Source

Two mutually exclusive styles for specifying database credentials:

URL-based -- set exactly one of url, url_env, or url_file:

source:
  type: postgres   # postgres | mysql
  url: "postgresql://user:pass@host:port/db"
  tuning:          # optional, defaults to balanced
    profile: safe  # safe | balanced | fast
source:
  type: mysql
  url_env: DATABASE_URL   # read full URL from this env var

Structured -- specify individual connection fields:

source:
  type: postgres
  host: db.example.com
  port: 5433              # optional; defaults to 5432 (PG) / 3306 (MySQL)
  user: admin
  password_env: DB_PASS   # reads password from env var; or use 'password: literal'
  database: mydb
Field Required Notes
host yes
user yes
database yes
port no defaults to 5432 (postgres) / 3306 (mysql)
password no plaintext; prefer password_env
password_env no env var name containing the password

URL-based and structured fields cannot be mixed. If both are present, validation rejects the config with a clear error.

Source Tuning

Controls how aggressively rivet reads from the database. Three named profiles with individual field overrides:

source:
  type: postgres
  url: "..."
  tuning:
    profile: safe              # base profile
    batch_size: 3000           # override: rows per fetch
    throttle_ms: 300           # override: sleep between fetches
    statement_timeout_s: 60    # override: per-query timeout

Profile Defaults

Parameter fast balanced (default) safe
batch_size 50,000 10,000 2,000
throttle_ms 0 50 500
statement_timeout_s 0 (none) 300 120
max_retries 1 3 5
retry_backoff_ms 1,000 2,000 5,000
lock_timeout_s 0 (none) 30 10

When to use each profile:

  • fast -- trusted environment, dedicated replica, need maximum throughput
  • balanced -- general purpose, moderate load on source
  • safe -- production OLTP database, shared resources, fragile source

If no tuning section is specified, balanced is used.

Exports

Each export defines a query, format, mode, and destination:

exports:
  - name: my_export            # unique name, used for state tracking
    query: "SELECT ..."        # SQL query to execute
    mode: full                 # full | incremental
    cursor_column: updated_at  # required for incremental mode
    format: parquet            # parquet | csv
    destination:
      type: local              # local | s3 | gcs
      path: ./output           # local: output directory

Meta Columns

Add metadata columns to every output row -- useful for deduplication and lineage on the raw/staging layer.

exports:
  - name: page_views
    query: "SELECT * FROM page_views"
    format: parquet
    meta_columns:
      exported_at: true   # adds _rivet_exported_at (UTC timestamp)
      row_hash: true      # adds _rivet_row_hash (xxh3_128 hex)
    destination:
      type: gcs
      bucket: my-bucket
Column Type Description
_rivet_exported_at Timestamp(us, UTC) When the batch was exported (same value for all rows in a batch)
_rivet_row_hash Int64 Lower 64 bits of xxHash3-128 over all column values. Integer for fast PARTITION BY / JOIN.

Dedup pattern (e.g. in BigQuery / DuckDB):

SELECT * FROM raw_page_views
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY _rivet_row_hash
  ORDER BY _rivet_exported_at DESC
) = 1

Both fields are optional and default to false. When disabled, no extra columns are added.

Compression

Parquet compression is configurable per export. Default: zstd (better compression ratio than Snappy at comparable speed).

exports:
  - name: orders
    query: "SELECT * FROM orders"
    format: parquet
    compression: zstd            # zstd | snappy | gzip | lz4 | none
    compression_level: 9         # optional; zstd 1..22 (default 3), gzip 0..10 (default 6)
    destination:
      type: local
      path: ./output
Codec Default level Notes
zstd 3 Best ratio/speed tradeoff; new default
snappy Fast, modest compression; previous default
gzip 6 Wide compatibility
lz4 Very fast decompression
none No compression; largest files

CSV exports ignore the compression setting.

Skip Empty Exports

When running scheduled/incremental exports, zero new rows often means nothing changed. Use skip_empty to avoid creating empty files:

exports:
  - name: events_inc
    query: "SELECT * FROM events"
    mode: incremental
    cursor_column: updated_at
    format: parquet
    skip_empty: true             # no file created when 0 rows; cursor not advanced
    destination:
      type: gcs
      bucket: my-bucket

When skip_empty: true and the query returns 0 rows:

  • No output file is created or uploaded
  • Cursor state is not advanced (safe to rerun)
  • Run summary shows status: skipped

Default: false (current behavior; 0-row exports still succeed with no file output).

Destinations

Local filesystem:

destination:
  type: local
  path: ./output

Amazon S3:

destination:
  type: s3
  bucket: my-bucket
  prefix: exports/data/
  region: us-east-1
  endpoint: https://...       # optional, for S3-compatible storage

Credentials: either omit key env fields and use the default AWS chain, or set both access_key_env and secret_key_env. Details: Credential precedence.

Google Cloud Storage:

destination:
  type: gcs
  bucket: my-bucket
  prefix: exports/data/
  endpoint: https://...       # optional
  credentials_file: /path/to/sa.json   # optional; omit to use ADC / env (see below)

GCS -- credentials: see Credential precedence. For day-to-day use on a workstation with a Google Cloud project, run gcloud auth application-default login and omit credentials_file; Rivet then uses Application Default Credentials (ADC).

Stdout (pipe to another tool):

destination:
  type: stdout

Writes file contents directly to stdout. Useful for piping into gzip, aws s3 cp -, or other streaming consumers. Only practical with a single export (multiple exports would interleave output).

Credential precedence

Rivet uses one predictable model for where secrets come from. Think of four layers (highest priority first). A higher layer wins when it applies; Rivet does not merge multiple cloud credential sources for the same destination.

Priority Layer Meaning
1 Config Fields in rivet.yaml (URLs, credentials_file, names of env vars for S3 keys).
2 Environment variables Process environment (DATABASE_URL via url_env, ${VAR} expansion in url, GOOGLE_APPLICATION_CREDENTIALS, standard AWS_* variables).
3 ADC / instance identity Provider default credentials with no explicit path in Rivet config (e.g. GCE/GKE metadata; local user ADC from gcloud auth application-default login).
4 File-based material Secret content read from disk when a path is chosen by config or environment (e.g. url_file, credentials_file, or the file pointed to by GOOGLE_APPLICATION_CREDENTIALS). This is not a separate "guess"; it is always wired through layer 1 or 2.

Database (PostgreSQL / MySQL)

Two mutually exclusive styles:

URL-based -- set exactly one of source.url, source.url_env, or source.url_file. There is no fallback between them.

Mechanism Resolution
url Connection string from config. Placeholders ${VAR} are expanded from the environment when the config file is loaded (missing variables become empty).
url_env The entire URL is read from the named environment variable.
url_file The entire URL is read from the file path given in config (trimmed).

Structured -- set host, user, database (and optionally port, password / password_env). Rivet builds the connection URL internally.

Cloud "ADC" does not apply to database URLs.

Google Cloud Storage (GCS)

Step Source
1 If destination.credentials_file is set -- use only that service account JSON path (config overrides env).
2 Else -- OpenDAL uses Google's default loader: GOOGLE_APPLICATION_CREDENTIALS (if set) -- JSON file at that path.
3 Else -- user ADC file from gcloud auth application-default login (well-known path under gcloud config).
4 Else -- GCE/GKE metadata-based service account when running on Google Cloud.

If you omit credentials_file, set RUST_LOG=info and look for a log line stating that the default Google credential chain is in use.

Amazon S3

Step Source
1 If both access_key_env and secret_key_env are set -- read access key and secret only from those variable names (error if unset).
2 If neither is set -- OpenDAL's default AWS chain: environment variables, shared config files (e.g. ~/.aws/credentials), then EC2/ECS instance metadata (IAM role).

Setting only one of access_key_env or secret_key_env is invalid and rejected at config validation.

Auth Diagnostics

rivet doctor verifies that source and destination credentials are valid before you run any exports:

$ rivet doctor --config rivet.yaml

rivet doctor: verifying auth for config 'rivet.yaml'

[OK]  Config parsed successfully
[OK]  Source auth (Postgres)
[OK]  Destination S3(my-bucket)
[FAIL] Destination GCS(other-bucket) -- auth error: loading credential ...

Some checks failed. Fix the issues above before running exports.

Error categories:

Category Meaning
auth error Credentials are missing, expired, or rejected
connectivity error Cannot reach the host (DNS, firewall, timeout)
bucket not found Bucket or path does not exist
error Other / uncategorized

Preflight Check

rivet check analyzes each export before running it. It connects to the source database, runs EXPLAIN on each query, and reports strategy, row estimates, verdicts, profile recommendations, and warnings:

$ rivet check --config rivet.yaml

Export: orders_incremental
  Strategy:     incremental(updated_at)
  Mode:         incremental (cursor: updated_at)
  Row estimate: ~1M
  Cursor range: 2024-01-01 .. 2025-01-30
  Scan type:    Index Scan using idx_orders_updated_at
  Verdict:      EFFICIENT
  Recommended:  tuning.profile: fast

Export: events_full
  Strategy:     full-scan
  Mode:         full
  Row estimate: ~5M
  Scan type:    Seq Scan on events
  Verdict:      DEGRADED
  Recommended:  tuning.profile: safe
  Suggestion:   No index detected -- full table scan. Add an indexed cursor
                column and switch to incremental mode. Use 'safe' tuning
                profile to limit database impact.

Export: orders_chunked
  Strategy:     chunked-parallel(id, size=100000, p=4)
  Mode:         chunked (column: id, size: 100000)
  Row estimate: ~10M
  Cursor range: 1 .. 50000000
  Scan type:    Index Scan using orders_pkey
  Verdict:      ACCEPTABLE
  Recommended:  tuning.profile: safe
  Warning:      Sparse key range: ~99% of chunk windows will be empty ...
  Suggestion:   Large dataset (~10M rows). Add parallel > 1 to speed up ...

Strategy Names

Strategy When
full-scan mode: full, parallel=1
full-parallel(N) mode: full, parallel > 1
incremental(col) mode: incremental
chunked(col, size=N) mode: chunked, parallel=1
chunked-parallel(col, size=N, p=P) mode: chunked, parallel > 1
time-window(col, Nd) mode: time_window

Profile Recommendation

rivet check recommends a tuning profile based on row estimate and index usage:

Condition Recommendation
Indexed, < 1M rows fast
Indexed, 1M-10M rows balanced
Indexed, > 10M rows safe
No index, < 100K rows fast (or balanced with parallel)
No index, 100K-1M rows balanced
No index, > 1M rows safe

Warnings

Warning Trigger
Sparse key range Chunked mode with < 10% density (range >> row count)
Dense surrogate sort cost Query uses ROW_NUMBER() in chunked mode
Parallel memory risk parallel > 1 on > 5M rows

Verdicts

Verdict Meaning
EFFICIENT Index scan on cursor column, reasonable row count (< 10M)
ACCEPTABLE Index scan but very large dataset, or partial index coverage
DEGRADED Full table scan detected, but row count is manageable
UNSAFE Full scan on very large table (> 50M rows) without index support

Suggestions are mode-aware: full exports recommend switching to incremental, chunked exports recommend indexing the chunk column, time-window exports recommend indexing the time column.

Incremental Mode

When mode: incremental is set, rivet:

  1. Reads the last exported cursor value from its SQLite state database
  2. Appends WHERE <cursor_column> > '<last_value>' to the query
  3. After a successful export, updates the cursor to the last row's value

The state database (.rivet_state.db) is created next to your config file.

Chunked Extraction

Rivet never loads an entire table into memory with a single query. Instead:

  • PostgreSQL: Uses server-side cursors (DECLARE CURSOR / FETCH N) to read batch_size rows at a time
  • MySQL: Uses streaming result sets (query_iter()) to read rows incrementally

Between each batch, rivet sleeps for throttle_ms milliseconds, giving the database breathing room.

Sparse IDs (gaps in the key range)

Chunked mode uses MIN(chunk_column) and MAX(chunk_column) from your export query, then issues WHERE chunk_column BETWEEN start AND end for each window. If the primary key is sparse (huge spread between min and max, few rows), most windows cover no rows but the database still plans and scans for each range.

Mitigation: chunk on a dense surrogate computed in SQL, for example ROW_NUMBER() OVER (ORDER BY id) AS chunk_rownum, and set chunk_column: chunk_rownum in the export. Then min/max match the row count, not the physical id span. A commented PostgreSQL example lives at tests/fixtures/migrations/001_sparse_chunk_column_example.sql.

Cost tradeoff: ORDER BY id (and therefore that window) is not free. The planner usually needs a global ordering of the rows you export: often a sort over the whole result, or an index scan on id if the shape of the query allows it -- either way you pay once per export pass, and under concurrent writes the ordering is tied to a snapshot. You are trading many cheap-but-useless BETWEEN probes on a sparse key for fewer chunk queries that each touch real rows, at the price of establishing dense row numbers. For very large or hot tables, prefer incremental mode on an indexed cursor column, a precomputed dense key (column or side table populated by batch jobs), or a materialized view refreshed off the critical path, if that fits your workload better than a window over live data.

Run Summary

After each export, Rivet prints a structured summary to stdout:

── orders ──
  run_id:      orders_20260329T125109.336
  status:      success
  rows:        150000
  files:       1
  bytes:       12.4 MB
  duration:    3.2s
  peak RSS:    142MB
  validated:   pass
  schema:      unchanged

All summary fields are also persisted to the metrics table and visible via rivet metrics. The run_id links the summary to the corresponding rows in export_metrics and file_manifest tables.

Field Description
run_id Canonical identifier for this run (links summary, metrics, and files)
status success or failed
rows Total rows extracted
files Number of files produced (1 for single-file modes; N for chunked)
bytes Total file size before upload
duration Wall-clock time for the export
peak RSS Peak process RSS during the export (MB)
retries Number of retry attempts (0 if no retries needed)
validated pass if --validate succeeded; omitted if not requested
schema unchanged or CHANGED; omitted on first run
error Error message (only on failure)

File manifest

Every file produced by Rivet is recorded in the file_manifest table. Use rivet state files to inspect:

$ rivet state files --config rivet.yaml
RUN ID                              FILE                                         ROWS      BYTES CREATED
--------------------------------------------------------------------------------------------------------------
orders_20260329T125143.912          orders_20260329_125200_chunk3.parquet        50000    17.4 MB 2026-03-29T12:52:00+00:00
orders_20260329T125143.912          orders_20260329_125156_chunk2.parquet        50000    17.4 MB 2026-03-29T12:51:56+00:00

This enables post-run reconciliation: verify which run created which files and confirm row counts match expectations.

Execution Semantics

Export lifecycle

Every export follows a strict sequence. Steps that fail cause the entire export to fail; state is never updated on failure.

1. Config load + validation
2. State read (load cursor for incremental; load schema for tracking)
3. Source connect (new connection per attempt)
4. Query start
   - full/incremental: single query
   - chunked: detect min/max, generate range queries
   - time_window: rewrite query with WHERE clause
5. Batch loop
   a. FETCH batch_size rows → Arrow RecordBatch
   b. FormatWriter.write_batch() → temp file (flush per batch)
   c. Sleep throttle_ms
   d. Repeat until source exhausted
6. FormatWriter.finish() → finalize temp file
7. Validate (if --validate): read back temp file, compare row count
8. Destination.write() → upload temp file to local/S3/GCS
9. State update (incremental only): advance cursor to last row's value
10. Schema tracking: compare columns with stored schema, warn on change
11. Metrics: record run result (duration, rows, RSS, status)

State update point

The cursor advances only after step 8 (successful upload). If any step fails, the cursor stays at its previous value. This means:

  • A failed export can be safely re-run without skipping data.
  • A successful upload followed by a process crash before step 9 causes the next run to re-export rows already uploaded (at-least-once semantics -- see Duplicates below).

Duplicates

Rivet provides at-least-once delivery. Duplicates can occur in these scenarios:

Scenario Cause Mitigation
Crash after upload, before cursor update Cursor is not advanced; next run re-exports the same window Downstream dedup on primary key + cursor column
time_window with overlapping windows Rows near the boundary appear in consecutive windows Downstream dedup or idempotent merge
incremental with non-monotonic cursor Rows inserted with cursor values older than the last exported value are missed; rows updated after export may be re-exported Use a strictly monotonic column (e.g. auto-increment id, updated_at with triggers)
chunked with concurrent writes New rows inserted during export may land in already-processed ranges Accept overlap or run during quiescent periods

Rivet never claims exactly-once delivery. Design downstream pipelines to tolerate duplicates.

Retry semantics

On failure, Rivet classifies the error and decides whether to retry:

Category Retry? Reconnect? Extra delay Examples
Network yes yes -- connection reset, broken pipe, DNS, SSL, EOF
MySQL disconnect yes yes -- server gone away, lost connection
Timeout yes no -- statement timeout, lock wait timeout
Capacity yes yes +15s too many connections, DB starting/shutting down
Deadlock yes no +1s deadlock detected, serialization failure
Auth/permission no -- -- permission denied, access denied, invalid credentials
Permanent no -- -- syntax error, table not found, column not found

On each retry, a fresh connection is created (never reuses a failed connection). Backoff is exponential: retry_backoff_ms * 2^(attempt-1) + extra_delay. The tuning profile controls max_retries and retry_backoff_ms.

Validation semantics

--validate re-reads the temp file after writing and compares the row count against the number of rows received from the source:

  • Parquet: opens the file with the Arrow reader and reads num_rows from footer metadata.
  • CSV: counts newlines (excluding header).

What it proves: the file on disk contains the expected number of rows (catches truncated writes, corrupt footers, I/O errors during flush).

What it does not prove:

  • Cell-level correctness (no checksum on individual values).
  • Source-to-file semantic equivalence (no re-query of the database to compare).
  • Post-upload integrity (the file is validated before upload, not after).

Supported Type Mappings

PostgreSQL MySQL Arrow / Parquet
BOOL BIT Boolean
INT2 / SMALLINT TINYINT, SMALLINT Int16
INT4 / INT INT, MEDIUMINT Int32
INT8 / BIGINT BIGINT Int64
FLOAT4 FLOAT Float32
FLOAT8 DOUBLE Float64
TEXT, VARCHAR VARCHAR, TEXT Utf8 (String)
BYTEA BLOB (binary charset) Binary
DATE DATE Date32
TIMESTAMP(TZ) DATETIME, TIMESTAMP Timestamp(us)
NUMERIC DECIMAL Utf8 (stringified)
JSON / JSONB JSON Utf8
UUID -- Utf8

Guarantees and Limitations

What Rivet guarantees

  • At-least-once delivery: if an export succeeds, all rows matching the query are written to at least one output file.
  • State atomicity per export: cursor state is updated only after successful upload. A crash mid-export does not advance the cursor.
  • Schema change detection: Rivet warns when columns are added, removed, or change type between runs.
  • Validation on demand: --validate confirms row counts match between source read and file on disk.
  • Predictable auth: credentials are resolved in a documented 4-layer order; no silent fallback surprises.

What Rivet does NOT guarantee

  • No exactly-once delivery: duplicates can occur on crash recovery, overlapping windows, or non-monotonic cursors.
  • No cell-level validation: --validate checks row count, not individual cell values or checksums.
  • No CDC / real-time: Rivet runs point-in-time queries; it does not read WAL, binlog, or change streams.
  • No load / merge: Rivet produces files. Loading them into a warehouse is your responsibility.
  • No distributed execution: Rivet runs on a single machine. parallel spawns threads, not remote workers.
  • No transactional consistency across exports: each export runs its own query; there is no cross-export snapshot isolation.
  • No encryption: output files are written in plaintext. Encrypt at the destination level if needed.

See Execution Semantics for detailed lifecycle, state update, duplicate, retry, and validation rules.

Development

For pilot documentation (per-mode guides, destination setup, annotated YAML examples), see docs/.

For a step-by-step onboarding guide (from installation to production-ready exports), see USER_GUIDE.md.

For a manual user acceptance checklist (CLI, modes, destinations, compression, skip-empty), see USER_TEST_PLAN.md.

Local Setup

Start PostgreSQL and MySQL with Docker:

docker compose up -d

Seed both databases with test data (100K users, ~1M orders, ~5M events):

cargo run --release --bin seed -- --target both --users 100000

The seed tool supports flags:

--target postgres|mysql|both    # which database to seed
--users N                       # number of users (default: 100000)
--orders-per-user N             # avg orders per user (default: 10)
--events-per-user N             # avg events per user (default: 50)
--batch-size N                  # insert batch size (default: 1000)
--pg-url URL                    # PostgreSQL connection URL
--mysql-url URL                 # MySQL connection URL

Toolchain

The project pins Rust 1.94 via rust-toolchain.toml. Install with:

rustup install 1.94

Running Tests

cargo test              # 617 unit + integration tests (no database needed)
cargo test -- --nocapture  # with output
cargo clippy --all-targets -- -D warnings  # lint check
cargo fmt --all -- --check                 # format check

End-to-end scripts (Docker Compose must be up, rivet built):

bash dev/test_permissions.sh
bash dev/test_schema_evolution.sh

CI

GitHub Actions runs on every push/PR to master/main:

  • Rustfmt — formatting check
  • Clippy — lint check with -D warnings
  • Tests — full test suite
  • Release build — ensures cargo build --release succeeds
  • Security auditcargo audit via rustsec/audit-check

Roadmap

See rivet_roadmap.md for the full roadmap (strategy + execution status).

Next milestones:

Milestone Focus
v0.2.0 (stable) Cross-platform release binaries, E2E test matrix, cargo publish, Docker image
v0.3.0 Source count reconciliation, crash/recovery tests, data shape drift detection, curated example configs
Future CDC mode, Iceberg/Delta output, webhook destination, multi-source joins, plugin system