Rivet
Lightweight, source-safe data extraction from PostgreSQL and MySQL to Parquet/CSV.
Rivet is a CLI tool that exports query results from relational databases to files -- locally or in cloud storage (S3, GCS). It is extract-only: no loading, no merging, no CDC. It is designed to be gentle on production databases through tuning profiles, preflight health checks, and intelligent retry with backoff.
What Rivet does
- Extracts data from PostgreSQL and MySQL via standard SQL queries
- Writes Parquet (zstd-compressed by default; snappy, gzip, lz4, none) or CSV files
- Uploads to local disk, Amazon S3, Google Cloud Storage, or stdout (pipe workflows)
- Tracks incremental state in SQLite so the next run picks up where the last left off
- Diagnoses source health before extraction (
rivet check) - Verifies auth for all sources and destinations before running (
rivet doctor) - Prints a structured run summary after each export (run ID, rows, files, bytes, duration, RSS, retries, schema changes)
- Persists metrics history, schema tracking, and file manifest in SQLite
- Recommends parallelism level and tuning profile in preflight checks
- Parameterized queries via
--param key=valueand${key}placeholders - Data quality checks — row count bounds, null ratio thresholds, uniqueness assertions
- File size splitting —
max_file_size: 512MBautomatically splits output into parts - Memory-based batch sizing —
batch_size_memory_mb: 256auto-tunes batch size from schema width - Slack notifications on failure, schema change, or degraded verdict
What Rivet does NOT do
- No loading/merging -- it produces files; you bring them into a warehouse yourself
- No CDC -- no WAL/binlog reading; query-based extraction only
- No orchestration -- no built-in scheduler; use cron, Airflow, or similar
- No exactly-once delivery -- at-least-once; duplicates are possible (see Execution Semantics)
- No web UI / API -- CLI and YAML config only
Documentation language: English-only. See CONTRIBUTING.md.
New to Rivet? Start with the Pilot Documentation — step-by-step guides for every export mode, destination, and YAML parameter, plus quickstart templates for your first export.
Installation
Homebrew (macOS / Linux) — recommended
Pre-built binaries
Download the latest release for your platform from GitHub Releases:
# macOS (Apple Silicon)
|
# macOS (Intel)
|
# Linux (x86_64)
|
# Linux (arm64)
|
Verify:
Docker
Try Rivet without installing anything — mount your config and output directory:
Or check the version and explore commands:
Pass environment variables for credentials:
Note: To connect to a database running on your host machine, use
host.docker.internalinstead oflocalhostin the connection URL.
Build from source
Requires Rust 1.94+:
# binary is at target/release/rivet
Quick Start
- Create a config file
rivet.yaml:
source:
type: postgres
url: "postgresql://user:pass@localhost:5432/mydb"
tuning:
profile: safe
exports:
- name: users
query: "SELECT id, name, email, updated_at FROM users"
mode: incremental
cursor_column: updated_at
format: parquet
destination:
type: local
path: ./output
- Run preflight check to diagnose source health:
- Verify auth for source and all destinations:
- Run the export:
RUST_LOG=info
- Check state:
Working with the binary
Once installed, rivet is a single self-contained binary with no runtime dependencies (no JVM, no Python, no Docker required).
Typical workflow:
# 1. Preflight: check that the source DB is reachable and healthy
# 2. Auth: verify credentials for source + all destinations (S3, GCS, etc.)
# 3. Export: run all exports defined in the config
RUST_LOG=info
# 4. Inspect: view cursor state and file manifest
# 5. Re-run: only new/changed rows are exported (incremental mode)
RUST_LOG=info
Useful flags:
Logging:
Rivet uses RUST_LOG for verbosity:
RUST_LOG=debug RUST_LOG=info RUST_LOG=warn
Shell completions:
# Bash
# Zsh
# Fish
CLI Reference
rivet run --config <path> # run all exports
rivet run --config <path> --export <name> # run a specific export
rivet run --config <path> --validate # verify row counts after write
rivet check --config <path> # preflight check all exports
rivet check --config <path> --export <name> # preflight check one export
rivet doctor --config <path> # verify source + destination auth
rivet state show --config <path> # show cursor state
rivet state reset --config <path> --export <name> # reset cursor
rivet state files --config <path> # show file manifest (which run created which files)
rivet metrics --config <path> # show export run history
rivet metrics --config <path> --export <name> # metrics for one export
rivet metrics --config <path> --last N # last N runs (default 20)
rivet completions <shell> # generate shell completions (bash|zsh|fish|powershell)
Shell completions:
# zsh (add to ~/.zshrc)
# bash
# fish
Set RUST_LOG=info (or debug) for detailed logging:
RUST_LOG=info
Choosing a Mode
| Mode | Best for | Key behavior |
|---|---|---|
full |
Small tables, snapshots, one-off exports | Exports entire query result every run |
incremental |
Append-only or update-tracked tables | Resumes from the last exported value of cursor_column |
chunked |
Very large tables (10M+ rows) | Splits into ID-range windows; supports parallel > 1 for concurrent extraction |
time_window |
Event logs, append-mostly data with timestamps | Exports only the last N days based on a time/date column |
Decision rules:
- Table < 1M rows, full snapshot needed -- use
full. - Table has a monotonically increasing column (auto-increment id,
updated_atwith triggers) -- useincremental. This is the most efficient mode for repeated runs. - Table is very large and you need parallel extraction -- use
chunkedwithparallel > 1. Setchunk_columnto the primary key. Watch out for sparse IDs (see Sparse IDs). - You only need recent data (e.g. last 7 days of events) -- use
time_window. Settime_columnanddays_window.
Can I combine modes? No. Each export uses exactly one mode. If you need both incremental tracking and chunked extraction, use incremental for ongoing syncs and chunked for backfills.
Choosing a Profile
| Profile | Source environment | Behavior |
|---|---|---|
fast |
Dedicated replica, data warehouse, trusted environment | Large batches, no throttle, no timeouts, minimal retries |
balanced |
General-purpose source, moderate concurrent load | 10K batch, 50ms throttle, 5-min statement timeout, 3 retries |
safe |
Production OLTP, shared resources, fragile source | Small batches, 500ms throttle, 2-min timeout, 5 retries with long backoff |
Decision rules:
- Dedicated read replica or analytics database --
fast. You own the capacity. - Production database with other workloads --
balanced. Good default. - Production OLTP under high load, or a database you don't fully control --
safe. Rivet backs off aggressively and retries patiently.
You can always override individual fields (e.g. profile: safe with batch_size: 5000).
Config Reference
Source
Two mutually exclusive styles for specifying database credentials:
URL-based -- set exactly one of url, url_env, or url_file:
source:
type: postgres # postgres | mysql
url: "postgresql://user:pass@host:port/db"
tuning: # optional, defaults to balanced
profile: safe # safe | balanced | fast
source:
type: mysql
url_env: DATABASE_URL # read full URL from this env var
Structured -- specify individual connection fields:
source:
type: postgres
host: db.example.com
port: 5433 # optional; defaults to 5432 (PG) / 3306 (MySQL)
user: admin
password_env: DB_PASS # reads password from env var; or use 'password: literal'
database: mydb
| Field | Required | Notes |
|---|---|---|
host |
yes | |
user |
yes | |
database |
yes | |
port |
no | defaults to 5432 (postgres) / 3306 (mysql) |
password |
no | plaintext; prefer password_env |
password_env |
no | env var name containing the password |
URL-based and structured fields cannot be mixed. If both are present, validation rejects the config with a clear error.
Source Tuning
Controls how aggressively rivet reads from the database. Three named profiles with individual field overrides:
source:
type: postgres
url: "..."
tuning:
profile: safe # base profile
batch_size: 3000 # override: rows per fetch
throttle_ms: 300 # override: sleep between fetches
statement_timeout_s: 60 # override: per-query timeout
Profile Defaults
| Parameter | fast |
balanced (default) |
safe |
|---|---|---|---|
batch_size |
50,000 | 10,000 | 2,000 |
throttle_ms |
0 | 50 | 500 |
statement_timeout_s |
0 (none) | 300 | 120 |
max_retries |
1 | 3 | 5 |
retry_backoff_ms |
1,000 | 2,000 | 5,000 |
lock_timeout_s |
0 (none) | 30 | 10 |
When to use each profile:
- fast -- trusted environment, dedicated replica, need maximum throughput
- balanced -- general purpose, moderate load on source
- safe -- production OLTP database, shared resources, fragile source
If no tuning section is specified, balanced is used.
Exports
Each export defines a query, format, mode, and destination:
exports:
- name: my_export # unique name, used for state tracking
query: "SELECT ..." # SQL query to execute
mode: full # full | incremental
cursor_column: updated_at # required for incremental mode
format: parquet # parquet | csv
destination:
type: local # local | s3 | gcs
path: ./output # local: output directory
Meta Columns
Add metadata columns to every output row -- useful for deduplication and lineage on the raw/staging layer.
exports:
- name: page_views
query: "SELECT * FROM page_views"
format: parquet
meta_columns:
exported_at: true # adds _rivet_exported_at (UTC timestamp)
row_hash: true # adds _rivet_row_hash (xxh3_128 hex)
destination:
type: gcs
bucket: my-bucket
| Column | Type | Description |
|---|---|---|
_rivet_exported_at |
Timestamp(us, UTC) |
When the batch was exported (same value for all rows in a batch) |
_rivet_row_hash |
Int64 |
Lower 64 bits of xxHash3-128 over all column values. Integer for fast PARTITION BY / JOIN. |
Dedup pattern (e.g. in BigQuery / DuckDB):
SELECT * FROM raw_page_views
QUALIFY ROW_NUMBER OVER (
PARTITION BY _rivet_row_hash
ORDER BY _rivet_exported_at DESC
) = 1
Both fields are optional and default to false. When disabled, no extra columns are added.
Compression
Parquet compression is configurable per export. Default: zstd (better compression ratio than Snappy at comparable speed).
exports:
- name: orders
query: "SELECT * FROM orders"
format: parquet
compression: zstd # zstd | snappy | gzip | lz4 | none
compression_level: 9 # optional; zstd 1..22 (default 3), gzip 0..10 (default 6)
destination:
type: local
path: ./output
| Codec | Default level | Notes |
|---|---|---|
zstd |
3 | Best ratio/speed tradeoff; new default |
snappy |
— | Fast, modest compression; previous default |
gzip |
6 | Wide compatibility |
lz4 |
— | Very fast decompression |
none |
— | No compression; largest files |
CSV exports ignore the compression setting.
Skip Empty Exports
When running scheduled/incremental exports, zero new rows often means nothing changed. Use skip_empty to avoid creating empty files:
exports:
- name: events_inc
query: "SELECT * FROM events"
mode: incremental
cursor_column: updated_at
format: parquet
skip_empty: true # no file created when 0 rows; cursor not advanced
destination:
type: gcs
bucket: my-bucket
When skip_empty: true and the query returns 0 rows:
- No output file is created or uploaded
- Cursor state is not advanced (safe to rerun)
- Run summary shows
status: skipped
Default: false (current behavior; 0-row exports still succeed with no file output).
Destinations
Local filesystem:
destination:
type: local
path: ./output
Amazon S3:
destination:
type: s3
bucket: my-bucket
prefix: exports/data/
region: us-east-1
endpoint: https://... # optional, for S3-compatible storage
Credentials: either omit key env fields and use the default AWS chain, or set both access_key_env and secret_key_env. Details: Credential precedence.
Google Cloud Storage:
destination:
type: gcs
bucket: my-bucket
prefix: exports/data/
endpoint: https://... # optional
credentials_file: /path/to/sa.json # optional; omit to use ADC / env (see below)
GCS -- credentials: see Credential precedence. For day-to-day use on a workstation with a Google Cloud project, run gcloud auth application-default login and omit credentials_file; Rivet then uses Application Default Credentials (ADC).
Stdout (pipe to another tool):
destination:
type: stdout
Writes file contents directly to stdout. Useful for piping into gzip, aws s3 cp -, or other streaming consumers. Only practical with a single export (multiple exports would interleave output).
Credential precedence
Rivet uses one predictable model for where secrets come from. Think of four layers (highest priority first). A higher layer wins when it applies; Rivet does not merge multiple cloud credential sources for the same destination.
| Priority | Layer | Meaning |
|---|---|---|
| 1 | Config | Fields in rivet.yaml (URLs, credentials_file, names of env vars for S3 keys). |
| 2 | Environment variables | Process environment (DATABASE_URL via url_env, ${VAR} expansion in url, GOOGLE_APPLICATION_CREDENTIALS, standard AWS_* variables). |
| 3 | ADC / instance identity | Provider default credentials with no explicit path in Rivet config (e.g. GCE/GKE metadata; local user ADC from gcloud auth application-default login). |
| 4 | File-based material | Secret content read from disk when a path is chosen by config or environment (e.g. url_file, credentials_file, or the file pointed to by GOOGLE_APPLICATION_CREDENTIALS). This is not a separate "guess"; it is always wired through layer 1 or 2. |
Database (PostgreSQL / MySQL)
Two mutually exclusive styles:
URL-based -- set exactly one of source.url, source.url_env, or source.url_file. There is no fallback between them.
| Mechanism | Resolution |
|---|---|
url |
Connection string from config. Placeholders ${VAR} are expanded from the environment when the config file is loaded (missing variables become empty). |
url_env |
The entire URL is read from the named environment variable. |
url_file |
The entire URL is read from the file path given in config (trimmed). |
Structured -- set host, user, database (and optionally port, password / password_env). Rivet builds the connection URL internally.
Cloud "ADC" does not apply to database URLs.
Google Cloud Storage (GCS)
| Step | Source |
|---|---|
| 1 | If destination.credentials_file is set -- use only that service account JSON path (config overrides env). |
| 2 | Else -- OpenDAL uses Google's default loader: GOOGLE_APPLICATION_CREDENTIALS (if set) -- JSON file at that path. |
| 3 | Else -- user ADC file from gcloud auth application-default login (well-known path under gcloud config). |
| 4 | Else -- GCE/GKE metadata-based service account when running on Google Cloud. |
If you omit credentials_file, set RUST_LOG=info and look for a log line stating that the default Google credential chain is in use.
Amazon S3
| Step | Source |
|---|---|
| 1 | If both access_key_env and secret_key_env are set -- read access key and secret only from those variable names (error if unset). |
| 2 | If neither is set -- OpenDAL's default AWS chain: environment variables, shared config files (e.g. ~/.aws/credentials), then EC2/ECS instance metadata (IAM role). |
Setting only one of access_key_env or secret_key_env is invalid and rejected at config validation.
Auth Diagnostics
rivet doctor verifies that source and destination credentials are valid before you run any exports:
$ rivet doctor --config rivet.yaml
rivet doctor: verifying auth for config 'rivet.yaml'
[OK] Config parsed successfully
[OK] Source auth (Postgres)
[OK] Destination S3(my-bucket)
[FAIL] Destination GCS(other-bucket) -- auth error: loading credential ...
Some checks failed. Fix the issues above before running exports.
Error categories:
| Category | Meaning |
|---|---|
auth error |
Credentials are missing, expired, or rejected |
connectivity error |
Cannot reach the host (DNS, firewall, timeout) |
bucket not found |
Bucket or path does not exist |
error |
Other / uncategorized |
Preflight Check
rivet check analyzes each export before running it. It connects to the source database, runs EXPLAIN on each query, and reports strategy, row estimates, verdicts, profile recommendations, and warnings:
$ rivet check --config rivet.yaml
Export: orders_incremental
Strategy: incremental(updated_at)
Mode: incremental (cursor: updated_at)
Row estimate: ~1M
Cursor range: 2024-01-01 .. 2025-01-30
Scan type: Index Scan using idx_orders_updated_at
Verdict: EFFICIENT
Recommended: tuning.profile: fast
Export: events_full
Strategy: full-scan
Mode: full
Row estimate: ~5M
Scan type: Seq Scan on events
Verdict: DEGRADED
Recommended: tuning.profile: safe
Suggestion: No index detected -- full table scan. Add an indexed cursor
column and switch to incremental mode. Use 'safe' tuning
profile to limit database impact.
Export: orders_chunked
Strategy: chunked-parallel(id, size=100000, p=4)
Mode: chunked (column: id, size: 100000)
Row estimate: ~10M
Cursor range: 1 .. 50000000
Scan type: Index Scan using orders_pkey
Verdict: ACCEPTABLE
Recommended: tuning.profile: safe
Warning: Sparse key range: ~99% of chunk windows will be empty ...
Suggestion: Large dataset (~10M rows). Add parallel > 1 to speed up ...
Strategy Names
| Strategy | When |
|---|---|
full-scan |
mode: full, parallel=1 |
full-parallel(N) |
mode: full, parallel > 1 |
incremental(col) |
mode: incremental |
chunked(col, size=N) |
mode: chunked, parallel=1 |
chunked-parallel(col, size=N, p=P) |
mode: chunked, parallel > 1 |
time-window(col, Nd) |
mode: time_window |
Profile Recommendation
rivet check recommends a tuning profile based on row estimate and index usage:
| Condition | Recommendation |
|---|---|
| Indexed, < 1M rows | fast |
| Indexed, 1M-10M rows | balanced |
| Indexed, > 10M rows | safe |
| No index, < 100K rows | fast (or balanced with parallel) |
| No index, 100K-1M rows | balanced |
| No index, > 1M rows | safe |
Warnings
| Warning | Trigger |
|---|---|
| Sparse key range | Chunked mode with < 10% density (range >> row count) |
| Dense surrogate sort cost | Query uses ROW_NUMBER() in chunked mode |
| Parallel memory risk | parallel > 1 on > 5M rows |
Verdicts
| Verdict | Meaning |
|---|---|
EFFICIENT |
Index scan on cursor column, reasonable row count (< 10M) |
ACCEPTABLE |
Index scan but very large dataset, or partial index coverage |
DEGRADED |
Full table scan detected, but row count is manageable |
UNSAFE |
Full scan on very large table (> 50M rows) without index support |
Suggestions are mode-aware: full exports recommend switching to incremental, chunked exports recommend indexing the chunk column, time-window exports recommend indexing the time column.
Incremental Mode
When mode: incremental is set, rivet:
- Reads the last exported cursor value from its SQLite state database
- Appends
WHERE <cursor_column> > '<last_value>'to the query - After a successful export, updates the cursor to the last row's value
The state database (.rivet_state.db) is created next to your config file.
Chunked Extraction
Rivet never loads an entire table into memory with a single query. Instead:
- PostgreSQL: Uses server-side cursors (
DECLARE CURSOR/FETCH N) to readbatch_sizerows at a time - MySQL: Uses streaming result sets (
query_iter()) to read rows incrementally
Between each batch, rivet sleeps for throttle_ms milliseconds, giving the database breathing room.
Sparse IDs (gaps in the key range)
Chunked mode uses MIN(chunk_column) and MAX(chunk_column) from your export query, then issues WHERE chunk_column BETWEEN start AND end for each window. If the primary key is sparse (huge spread between min and max, few rows), most windows cover no rows but the database still plans and scans for each range.
Mitigation: chunk on a dense surrogate computed in SQL, for example ROW_NUMBER() OVER (ORDER BY id) AS chunk_rownum, and set chunk_column: chunk_rownum in the export. Then min/max match the row count, not the physical id span. A commented PostgreSQL example lives at tests/fixtures/migrations/001_sparse_chunk_column_example.sql.
Cost tradeoff: ORDER BY id (and therefore that window) is not free. The planner usually needs a global ordering of the rows you export: often a sort over the whole result, or an index scan on id if the shape of the query allows it -- either way you pay once per export pass, and under concurrent writes the ordering is tied to a snapshot. You are trading many cheap-but-useless BETWEEN probes on a sparse key for fewer chunk queries that each touch real rows, at the price of establishing dense row numbers. For very large or hot tables, prefer incremental mode on an indexed cursor column, a precomputed dense key (column or side table populated by batch jobs), or a materialized view refreshed off the critical path, if that fits your workload better than a window over live data.
Run Summary
After each export, Rivet prints a structured summary to stdout:
── orders ──
run_id: orders_20260329T125109.336
status: success
rows: 150000
files: 1
bytes: 12.4 MB
duration: 3.2s
peak RSS: 142MB
validated: pass
schema: unchanged
All summary fields are also persisted to the metrics table and visible via rivet metrics. The run_id links the summary to the corresponding rows in export_metrics and file_manifest tables.
| Field | Description |
|---|---|
run_id |
Canonical identifier for this run (links summary, metrics, and files) |
status |
success or failed |
rows |
Total rows extracted |
files |
Number of files produced (1 for single-file modes; N for chunked) |
bytes |
Total file size before upload |
duration |
Wall-clock time for the export |
peak RSS |
Peak process RSS during the export (MB) |
retries |
Number of retry attempts (0 if no retries needed) |
validated |
pass if --validate succeeded; omitted if not requested |
schema |
unchanged or CHANGED; omitted on first run |
error |
Error message (only on failure) |
File manifest
Every file produced by Rivet is recorded in the file_manifest table. Use rivet state files to inspect:
$ rivet state files --config rivet.yaml
RUN ID FILE ROWS BYTES CREATED
--------------------------------------------------------------------------------------------------------------
orders_20260329T125143.912 orders_20260329_125200_chunk3.parquet 50000 17.4 MB 2026-03-29T12:52:00+00:00
orders_20260329T125143.912 orders_20260329_125156_chunk2.parquet 50000 17.4 MB 2026-03-29T12:51:56+00:00
This enables post-run reconciliation: verify which run created which files and confirm row counts match expectations.
Execution Semantics
Export lifecycle
Every export follows a strict sequence. Steps that fail cause the entire export to fail; state is never updated on failure.
1. Config load + validation
2. State read (load cursor for incremental; load schema for tracking)
3. Source connect (new connection per attempt)
4. Query start
- full/incremental: single query
- chunked: detect min/max, generate range queries
- time_window: rewrite query with WHERE clause
5. Batch loop
a. FETCH batch_size rows → Arrow RecordBatch
b. FormatWriter.write_batch() → temp file (flush per batch)
c. Sleep throttle_ms
d. Repeat until source exhausted
6. FormatWriter.finish() → finalize temp file
7. Validate (if --validate): read back temp file, compare row count
8. Destination.write() → upload temp file to local/S3/GCS
9. State update (incremental only): advance cursor to last row's value
10. Schema tracking: compare columns with stored schema, warn on change
11. Metrics: record run result (duration, rows, RSS, status)
State update point
The cursor advances only after step 8 (successful upload). If any step fails, the cursor stays at its previous value. This means:
- A failed export can be safely re-run without skipping data.
- A successful upload followed by a process crash before step 9 causes the next run to re-export rows already uploaded (at-least-once semantics -- see Duplicates below).
Duplicates
Rivet provides at-least-once delivery. Duplicates can occur in these scenarios:
| Scenario | Cause | Mitigation |
|---|---|---|
| Crash after upload, before cursor update | Cursor is not advanced; next run re-exports the same window | Downstream dedup on primary key + cursor column |
time_window with overlapping windows |
Rows near the boundary appear in consecutive windows | Downstream dedup or idempotent merge |
incremental with non-monotonic cursor |
Rows inserted with cursor values older than the last exported value are missed; rows updated after export may be re-exported | Use a strictly monotonic column (e.g. auto-increment id, updated_at with triggers) |
chunked with concurrent writes |
New rows inserted during export may land in already-processed ranges | Accept overlap or run during quiescent periods |
Rivet never claims exactly-once delivery. Design downstream pipelines to tolerate duplicates.
Retry semantics
On failure, Rivet classifies the error and decides whether to retry:
| Category | Retry? | Reconnect? | Extra delay | Examples |
|---|---|---|---|---|
| Network | yes | yes | -- | connection reset, broken pipe, DNS, SSL, EOF |
| MySQL disconnect | yes | yes | -- | server gone away, lost connection |
| Timeout | yes | no | -- | statement timeout, lock wait timeout |
| Capacity | yes | yes | +15s | too many connections, DB starting/shutting down |
| Deadlock | yes | no | +1s | deadlock detected, serialization failure |
| Auth/permission | no | -- | -- | permission denied, access denied, invalid credentials |
| Permanent | no | -- | -- | syntax error, table not found, column not found |
On each retry, a fresh connection is created (never reuses a failed connection). Backoff is exponential: retry_backoff_ms * 2^(attempt-1) + extra_delay. The tuning profile controls max_retries and retry_backoff_ms.
Validation semantics
--validate re-reads the temp file after writing and compares the row count against the number of rows received from the source:
- Parquet: opens the file with the Arrow reader and reads
num_rowsfrom footer metadata. - CSV: counts newlines (excluding header).
What it proves: the file on disk contains the expected number of rows (catches truncated writes, corrupt footers, I/O errors during flush).
What it does not prove:
- Cell-level correctness (no checksum on individual values).
- Source-to-file semantic equivalence (no re-query of the database to compare).
- Post-upload integrity (the file is validated before upload, not after).
Supported Type Mappings
| PostgreSQL | MySQL | Arrow / Parquet |
|---|---|---|
BOOL |
BIT |
Boolean |
INT2 / SMALLINT |
TINYINT, SMALLINT |
Int16 |
INT4 / INT |
INT, MEDIUMINT |
Int32 |
INT8 / BIGINT |
BIGINT |
Int64 |
FLOAT4 |
FLOAT |
Float32 |
FLOAT8 |
DOUBLE |
Float64 |
TEXT, VARCHAR |
VARCHAR, TEXT |
Utf8 (String) |
BYTEA |
BLOB (binary charset) |
Binary |
DATE |
DATE |
Date32 |
TIMESTAMP(TZ) |
DATETIME, TIMESTAMP |
Timestamp(us) |
NUMERIC |
DECIMAL |
Utf8 (stringified) |
JSON / JSONB |
JSON |
Utf8 |
UUID |
-- | Utf8 |
Guarantees and Limitations
What Rivet guarantees
- At-least-once delivery: if an export succeeds, all rows matching the query are written to at least one output file.
- State atomicity per export: cursor state is updated only after successful upload. A crash mid-export does not advance the cursor.
- Schema change detection: Rivet warns when columns are added, removed, or change type between runs.
- Validation on demand:
--validateconfirms row counts match between source read and file on disk. - Predictable auth: credentials are resolved in a documented 4-layer order; no silent fallback surprises.
What Rivet does NOT guarantee
- No exactly-once delivery: duplicates can occur on crash recovery, overlapping windows, or non-monotonic cursors.
- No cell-level validation:
--validatechecks row count, not individual cell values or checksums. - No CDC / real-time: Rivet runs point-in-time queries; it does not read WAL, binlog, or change streams.
- No load / merge: Rivet produces files. Loading them into a warehouse is your responsibility.
- No distributed execution: Rivet runs on a single machine.
parallelspawns threads, not remote workers. - No transactional consistency across exports: each export runs its own query; there is no cross-export snapshot isolation.
- No encryption: output files are written in plaintext. Encrypt at the destination level if needed.
See Execution Semantics for detailed lifecycle, state update, duplicate, retry, and validation rules.
Development
For pilot documentation (per-mode guides, destination setup, annotated YAML examples), see docs/.
For a step-by-step onboarding guide (from installation to production-ready exports), see USER_GUIDE.md.
For a manual user acceptance checklist (CLI, modes, destinations, compression, skip-empty), see USER_TEST_PLAN.md.
Local Setup
Start PostgreSQL and MySQL with Docker:
Seed both databases with test data (100K users, ~1M orders, ~5M events):
The seed tool supports flags:
--target postgres|mysql|both # which database to seed
--users N # number of users (default: 100000)
--orders-per-user N # avg orders per user (default: 10)
--events-per-user N # avg events per user (default: 50)
--batch-size N # insert batch size (default: 1000)
--pg-url URL # PostgreSQL connection URL
--mysql-url URL # MySQL connection URL
Toolchain
The project pins Rust 1.94 via rust-toolchain.toml. Install with:
Running Tests
End-to-end scripts (Docker Compose must be up, rivet built):
CI
GitHub Actions runs on every push/PR to master/main:
- Rustfmt — formatting check
- Clippy — lint check with
-D warnings - Tests — full test suite
- Release build — ensures
cargo build --releasesucceeds - Security audit —
cargo auditviarustsec/audit-check
Roadmap
See rivet_roadmap.md for the full roadmap (strategy + execution status).
Next milestones:
| Milestone | Focus |
|---|---|
| v0.2.0 (stable) | Cross-platform release binaries, E2E test matrix, cargo publish, Docker image |
| v0.3.0 | Source count reconciliation, crash/recovery tests, data shape drift detection, curated example configs |
| Future | CDC mode, Iceberg/Delta output, webhook destination, multi-source joins, plugin system |