rivet-cli 0.2.0-beta.2

# Rivet

**Lightweight, source-safe data extraction from PostgreSQL and MySQL to Parquet/CSV.**

Rivet is a CLI tool that exports query results from relational databases to files -- locally or in cloud storage (S3, GCS). It is **extract-only**: no loading, no merging, no CDC. It is designed to be gentle on production databases through tuning profiles, preflight health checks, and intelligent retry with backoff.

### What Rivet does

- Extracts data from **PostgreSQL** and **MySQL** via standard SQL queries
- Writes **Parquet** (zstd-compressed by default; snappy, gzip, lz4, none) or **CSV** files
- Uploads to **local disk**, **Amazon S3**, **Google Cloud Storage**, or **stdout** (pipe workflows)
- Tracks incremental state in **SQLite** so the next run picks up where the last left off
- Diagnoses source health **before** extraction (`rivet check`)
- Verifies auth for all sources and destinations **before** running (`rivet doctor`)
- Prints a structured **run summary** after each export (run ID, rows, files, bytes, duration, RSS, retries, schema changes)
- Persists **metrics history**, **schema tracking**, and **file manifest** in SQLite
- Recommends **parallelism level** and **tuning profile** in preflight checks
- **Parameterized queries** via `--param key=value` and `${key}` placeholders
- **Data quality checks** — row count bounds, null ratio thresholds, uniqueness assertions
- **File size splitting** — `max_file_size: 512MB` automatically splits output into parts
- **Memory-based batch sizing** — `batch_size_memory_mb: 256` auto-tunes batch size from schema width
- **Slack notifications** on failure, schema change, or degraded verdict

### What Rivet does NOT do

- **No loading/merging** -- it produces files; you bring them into a warehouse yourself
- **No CDC** -- no WAL/binlog reading; query-based extraction only
- **No orchestration** -- no built-in scheduler; use cron, Airflow, or similar
- **No exactly-once delivery** -- at-least-once; duplicates are possible (see [Execution Semantics](#execution-semantics))
- **No web UI / API** -- CLI and YAML config only

**Documentation language:** English-only. See [CONTRIBUTING.md](CONTRIBUTING.md).

> **New to Rivet?** Start with the [Pilot Documentation](docs/) — step-by-step guides for every export mode, destination, and YAML parameter, plus quickstart templates for your first export.

## Installation

### Homebrew (macOS / Linux) — recommended

```bash
brew tap panchenkoai/rivet
brew update
brew install rivet
rivet --version
```

### Pre-built binaries

Download the latest release for your platform from [GitHub Releases](https://github.com/panchenkoai/rivet/releases):

```bash
# macOS (Apple Silicon)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-aarch64-apple-darwin.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-x86_64-apple-darwin.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-x86_64-unknown-linux-gnu.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# Linux (arm64)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-aarch64-unknown-linux-gnu.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/
```

Verify:

```bash
rivet --version
```

### Docker

Try Rivet without installing anything — mount your config and output directory:

```bash
docker run --rm \
  -v $(pwd)/rivet.yaml:/config/rivet.yaml \
  -v $(pwd)/output:/output \
  ghcr.io/panchenkoai/rivet:latest \
  run --config /config/rivet.yaml
```

Or check the version and explore commands:

```bash
docker run --rm ghcr.io/panchenkoai/rivet:latest --version
docker run --rm ghcr.io/panchenkoai/rivet:latest --help
```

Pass environment variables for credentials:

```bash
docker run --rm \
  -e DATABASE_URL="postgresql://user:pass@host:5432/db" \
  -v $(pwd)/rivet.yaml:/config/rivet.yaml \
  -v $(pwd)/output:/output \
  ghcr.io/panchenkoai/rivet:latest \
  run --config /config/rivet.yaml
```

> **Note:** To connect to a database running on your host machine, use `host.docker.internal` instead of `localhost` in the connection URL.

### Build from source

Requires Rust 1.94+:

```bash
git clone https://github.com/panchenkoai/rivet.git
cd rivet
cargo build --release
# binary is at target/release/rivet
```

## Quick Start

1. Create a config file `rivet.yaml`:

```yaml
source:
  type: postgres
  url: "postgresql://user:pass@localhost:5432/mydb"
  tuning:
    profile: safe

exports:
  - name: users
    query: "SELECT id, name, email, updated_at FROM users"
    mode: incremental
    cursor_column: updated_at
    format: parquet
    destination:
      type: local
      path: ./output
```

2. Run preflight check to diagnose source health:

```bash
rivet check --config rivet.yaml
```

3. Verify auth for source and all destinations:

```bash
rivet doctor --config rivet.yaml
```

4. Run the export:

```bash
RUST_LOG=info rivet run --config rivet.yaml
```

5. Check state:

```bash
rivet state show --config rivet.yaml
```

## Working with the binary

Once installed, `rivet` is a single self-contained binary with no runtime dependencies (no JVM, no Python, no Docker required).

**Typical workflow:**

```bash
# 1. Preflight: check that the source DB is reachable and healthy
rivet check --config rivet.yaml

# 2. Auth: verify credentials for source + all destinations (S3, GCS, etc.)
rivet doctor --config rivet.yaml

# 3. Export: run all exports defined in the config
RUST_LOG=info rivet run --config rivet.yaml

# 4. Inspect: view cursor state and file manifest
rivet state show --config rivet.yaml
rivet state files --config rivet.yaml

# 5. Re-run: only new/changed rows are exported (incremental mode)
RUST_LOG=info rivet run --config rivet.yaml
```

**Useful flags:**

```bash
rivet run --config rivet.yaml --export users      # run a single export
rivet run --config rivet.yaml --validate           # reconcile row counts after write
rivet run --config rivet.yaml --param env=prod     # parameterized queries
rivet state reset --config rivet.yaml --export users  # reset cursor to re-export from scratch
```

**Logging:**

Rivet uses `RUST_LOG` for verbosity:

```bash
RUST_LOG=debug rivet run --config rivet.yaml    # verbose (SQL, batch timings, retries)
RUST_LOG=info  rivet run --config rivet.yaml    # normal (progress, summary)
RUST_LOG=warn  rivet run --config rivet.yaml    # quiet (errors and warnings only)
```

**Shell completions:**

```bash
# Bash
rivet completions bash > ~/.local/share/bash-completion/completions/rivet

# Zsh
rivet completions zsh > ~/.zfunc/_rivet

# Fish
rivet completions fish > ~/.config/fish/completions/rivet.fish
```

## CLI Reference

```
rivet run --config <path>                          # run all exports
rivet run --config <path> --export <name>          # run a specific export
rivet run --config <path> --validate               # verify row counts after write
rivet check --config <path>                        # preflight check all exports
rivet check --config <path> --export <name>        # preflight check one export
rivet doctor --config <path>                       # verify source + destination auth
rivet state show --config <path>                   # show cursor state
rivet state reset --config <path> --export <name>  # reset cursor
rivet state files --config <path>                  # show file manifest (which run created which files)
rivet metrics --config <path>                      # show export run history
rivet metrics --config <path> --export <name>      # metrics for one export
rivet metrics --config <path> --last N             # last N runs (default 20)
rivet completions <shell>                          # generate shell completions (bash|zsh|fish|powershell)
```

**Shell completions:**

```bash
# zsh (add to ~/.zshrc)
rivet completions zsh > ~/.zfunc/_rivet

# bash
rivet completions bash > /etc/bash_completion.d/rivet

# fish
rivet completions fish > ~/.config/fish/completions/rivet.fish
```

Set `RUST_LOG=info` (or `debug`) for detailed logging:

```bash
RUST_LOG=info rivet run --config rivet.yaml
```

## Choosing a Mode

| Mode | Best for | Key behavior |
|------|----------|--------------|
| `full` | Small tables, snapshots, one-off exports | Exports entire query result every run |
| `incremental` | Append-only or update-tracked tables | Resumes from the last exported value of `cursor_column` |
| `chunked` | Very large tables (10M+ rows) | Splits into ID-range windows; supports `parallel > 1` for concurrent extraction |
| `time_window` | Event logs, append-mostly data with timestamps | Exports only the last N days based on a time/date column |

**Decision rules:**

1. **Table < 1M rows, full snapshot needed** -- use `full`.
2. **Table has a monotonically increasing column** (auto-increment id, `updated_at` with triggers) -- use `incremental`. This is the most efficient mode for repeated runs.
3. **Table is very large and you need parallel extraction** -- use `chunked` with `parallel > 1`. Set `chunk_column` to the primary key. Watch out for sparse IDs (see [Sparse IDs](#sparse-ids-gaps-in-the-key-range)).
4. **You only need recent data** (e.g. last 7 days of events) -- use `time_window`. Set `time_column` and `days_window`.

**Can I combine modes?** No. Each export uses exactly one mode. If you need both incremental tracking and chunked extraction, use `incremental` for ongoing syncs and `chunked` for backfills.

## Choosing a Profile

| Profile | Source environment | Behavior |
|---------|--------------------|----------|
| `fast` | Dedicated replica, data warehouse, trusted environment | Large batches, no throttle, no timeouts, minimal retries |
| `balanced` | General-purpose source, moderate concurrent load | 10K batch, 50ms throttle, 5-min statement timeout, 3 retries |
| `safe` | Production OLTP, shared resources, fragile source | Small batches, 500ms throttle, 2-min timeout, 5 retries with long backoff |

**Decision rules:**

1. **Dedicated read replica or analytics database** -- `fast`. You own the capacity.
2. **Production database with other workloads** -- `balanced`. Good default.
3. **Production OLTP under high load, or a database you don't fully control** -- `safe`. Rivet backs off aggressively and retries patiently.

You can always override individual fields (e.g. `profile: safe` with `batch_size: 5000`).

## Config Reference

### Source

Two mutually exclusive styles for specifying database credentials:

**URL-based** -- set exactly one of `url`, `url_env`, or `url_file`:

```yaml
source:
  type: postgres   # postgres | mysql
  url: "postgresql://user:pass@host:port/db"
  tuning:          # optional, defaults to balanced
    profile: safe  # safe | balanced | fast
```

```yaml
source:
  type: mysql
  url_env: DATABASE_URL   # read full URL from this env var
```

**Structured** -- specify individual connection fields:

```yaml
source:
  type: postgres
  host: db.example.com
  port: 5433              # optional; defaults to 5432 (PG) / 3306 (MySQL)
  user: admin
  password_env: DB_PASS   # reads password from env var; or use 'password: literal'
  database: mydb
```

| Field | Required | Notes |
|-------|----------|-------|
| `host` | yes | |
| `user` | yes | |
| `database` | yes | |
| `port` | no | defaults to 5432 (postgres) / 3306 (mysql) |
| `password` | no | plaintext; prefer `password_env` |
| `password_env` | no | env var name containing the password |

URL-based and structured fields **cannot** be mixed. If both are present, validation rejects the config with a clear error.

### Source Tuning

Controls how aggressively rivet reads from the database. Three named profiles with individual field overrides:

```yaml
source:
  type: postgres
  url: "..."
  tuning:
    profile: safe              # base profile
    batch_size: 3000           # override: rows per fetch
    throttle_ms: 300           # override: sleep between fetches
    statement_timeout_s: 60    # override: per-query timeout
```

#### Profile Defaults

| Parameter | `fast` | `balanced` (default) | `safe` |
|---|---|---|---|
| `batch_size` | 50,000 | 10,000 | 2,000 |
| `throttle_ms` | 0 | 50 | 500 |
| `statement_timeout_s` | 0 (none) | 300 | 120 |
| `max_retries` | 1 | 3 | 5 |
| `retry_backoff_ms` | 1,000 | 2,000 | 5,000 |
| `lock_timeout_s` | 0 (none) | 30 | 10 |

**When to use each profile:**

- **fast** -- trusted environment, dedicated replica, need maximum throughput
- **balanced** -- general purpose, moderate load on source
- **safe** -- production OLTP database, shared resources, fragile source

If no `tuning` section is specified, `balanced` is used.

### Exports

Each export defines a query, format, mode, and destination:

```yaml
exports:
  - name: my_export            # unique name, used for state tracking
    query: "SELECT ..."        # SQL query to execute
    mode: full                 # full | incremental
    cursor_column: updated_at  # required for incremental mode
    format: parquet            # parquet | csv
    destination:
      type: local              # local | s3 | gcs
      path: ./output           # local: output directory
```

### Meta Columns

Add metadata columns to every output row -- useful for deduplication and lineage on the raw/staging layer.

```yaml
exports:
  - name: page_views
    query: "SELECT * FROM page_views"
    format: parquet
    meta_columns:
      exported_at: true   # adds _rivet_exported_at (UTC timestamp)
      row_hash: true      # adds _rivet_row_hash (xxh3_128 hex)
    destination:
      type: gcs
      bucket: my-bucket
```

| Column | Type | Description |
|--------|------|-------------|
| `_rivet_exported_at` | `Timestamp(us, UTC)` | When the batch was exported (same value for all rows in a batch) |
| `_rivet_row_hash` | `Int64` | Lower 64 bits of xxHash3-128 over all column values. Integer for fast `PARTITION BY` / `JOIN`. |

**Dedup pattern** (e.g. in BigQuery / DuckDB):

```sql
SELECT * FROM raw_page_views
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY _rivet_row_hash
  ORDER BY _rivet_exported_at DESC
) = 1
```

Both fields are optional and default to `false`. When disabled, no extra columns are added.

### Compression

Parquet compression is configurable per export. Default: **zstd** (better compression ratio than Snappy at comparable speed).

```yaml
exports:
  - name: orders
    query: "SELECT * FROM orders"
    format: parquet
    compression: zstd            # zstd | snappy | gzip | lz4 | none
    compression_level: 9         # optional; zstd 1..22 (default 3), gzip 0..10 (default 6)
    destination:
      type: local
      path: ./output
```

| Codec | Default level | Notes |
|-------|--------------|-------|
| `zstd` | 3 | Best ratio/speed tradeoff; new default |
| `snappy` | — | Fast, modest compression; previous default |
| `gzip` | 6 | Wide compatibility |
| `lz4` | — | Very fast decompression |
| `none` | — | No compression; largest files |

CSV exports ignore the compression setting.

### Skip Empty Exports

When running scheduled/incremental exports, zero new rows often means nothing changed. Use `skip_empty` to avoid creating empty files:

```yaml
exports:
  - name: events_inc
    query: "SELECT * FROM events"
    mode: incremental
    cursor_column: updated_at
    format: parquet
    skip_empty: true             # no file created when 0 rows; cursor not advanced
    destination:
      type: gcs
      bucket: my-bucket
```

When `skip_empty: true` and the query returns 0 rows:
- No output file is created or uploaded
- Cursor state is **not** advanced (safe to rerun)
- Run summary shows `status: skipped`

Default: `false` (current behavior; 0-row exports still succeed with no file output).

### Destinations

**Local filesystem:**

```yaml
destination:
  type: local
  path: ./output
```

**Amazon S3:**

```yaml
destination:
  type: s3
  bucket: my-bucket
  prefix: exports/data/
  region: us-east-1
  endpoint: https://...       # optional, for S3-compatible storage
```

Credentials: either omit key env fields and use the default AWS chain, or set **both** `access_key_env` and `secret_key_env`. Details: [Credential precedence](#credential-precedence).

**Google Cloud Storage:**

```yaml
destination:
  type: gcs
  bucket: my-bucket
  prefix: exports/data/
  endpoint: https://...       # optional
  credentials_file: /path/to/sa.json   # optional; omit to use ADC / env (see below)
```

**GCS -- credentials:** see [Credential precedence](#credential-precedence). For day-to-day use on a workstation with a Google Cloud project, run `gcloud auth application-default login` and omit `credentials_file`; Rivet then uses **Application Default Credentials** (ADC).

**Stdout (pipe to another tool):**

```yaml
destination:
  type: stdout
```

Writes file contents directly to stdout. Useful for piping into `gzip`, `aws s3 cp -`, or other streaming consumers. Only practical with a single export (multiple exports would interleave output).

## Credential precedence

Rivet uses one predictable model for *where* secrets come from. Think of four layers (highest priority first). A higher layer **wins** when it applies; Rivet does **not** merge multiple cloud credential sources for the same destination.

| Priority | Layer | Meaning |
|----------|--------|---------|
| 1 | **Config** | Fields in `rivet.yaml` (URLs, `credentials_file`, names of env vars for S3 keys). |
| 2 | **Environment variables** | Process environment (`DATABASE_URL` via `url_env`, `${VAR}` expansion in `url`, `GOOGLE_APPLICATION_CREDENTIALS`, standard `AWS_*` variables). |
| 3 | **ADC / instance identity** | Provider default credentials with **no** explicit path in Rivet config (e.g. GCE/GKE metadata; local user ADC from `gcloud auth application-default login`). |
| 4 | **File-based material** | Secret **content** read from disk when a path is chosen by config or environment (e.g. `url_file`, `credentials_file`, or the file pointed to by `GOOGLE_APPLICATION_CREDENTIALS`). This is not a separate "guess"; it is always wired through layer 1 or 2. |

### Database (PostgreSQL / MySQL)

Two mutually exclusive styles:

**URL-based** -- set exactly **one** of `source.url`, `source.url_env`, or `source.url_file`. There is **no** fallback between them.

| Mechanism | Resolution |
|-----------|----------------|
| `url` | Connection string from config. Placeholders `${VAR}` are expanded from the environment when the config file is loaded (missing variables become empty). |
| `url_env` | The **entire** URL is read from the named environment variable. |
| `url_file` | The **entire** URL is read from the file path given in config (trimmed). |

**Structured** -- set `host`, `user`, `database` (and optionally `port`, `password` / `password_env`). Rivet builds the connection URL internally.

Cloud "ADC" does not apply to database URLs.

### Google Cloud Storage (GCS)

| Step | Source |
|------|--------|
| 1 | If `destination.credentials_file` is set -- use **only** that service account JSON path (config overrides env). |
| 2 | Else -- OpenDAL uses Google's default loader: `GOOGLE_APPLICATION_CREDENTIALS` (if set) -- JSON file at that path. |
| 3 | Else -- user ADC file from `gcloud auth application-default login` (well-known path under gcloud config). |
| 4 | Else -- GCE/GKE metadata-based service account when running on Google Cloud. |

If you omit `credentials_file`, set `RUST_LOG=info` and look for a log line stating that the default Google credential chain is in use.

### Amazon S3

| Step | Source |
|------|--------|
| 1 | If **both** `access_key_env` and `secret_key_env` are set -- read access key and secret **only** from those variable names (error if unset). |
| 2 | If **neither** is set -- OpenDAL's default AWS chain: environment variables, shared config files (e.g. `~/.aws/credentials`), then EC2/ECS instance metadata (IAM role). |

Setting **only one** of `access_key_env` or `secret_key_env` is invalid and rejected at config validation.

## Auth Diagnostics

`rivet doctor` verifies that source and destination credentials are valid before you run any exports:

```
$ rivet doctor --config rivet.yaml

rivet doctor: verifying auth for config 'rivet.yaml'

[OK]  Config parsed successfully
[OK]  Source auth (Postgres)
[OK]  Destination S3(my-bucket)
[FAIL] Destination GCS(other-bucket) -- auth error: loading credential ...

Some checks failed. Fix the issues above before running exports.
```

Error categories:

| Category | Meaning |
|----------|---------|
| `auth error` | Credentials are missing, expired, or rejected |
| `connectivity error` | Cannot reach the host (DNS, firewall, timeout) |
| `bucket not found` | Bucket or path does not exist |
| `error` | Other / uncategorized |

## Preflight Check

`rivet check` analyzes each export before running it. It connects to the source database, runs `EXPLAIN` on each query, and reports strategy, row estimates, verdicts, profile recommendations, and warnings:

```
$ rivet check --config rivet.yaml

Export: orders_incremental
  Strategy:     incremental(updated_at)
  Mode:         incremental (cursor: updated_at)
  Row estimate: ~1M
  Cursor range: 2024-01-01 .. 2025-01-30
  Scan type:    Index Scan using idx_orders_updated_at
  Verdict:      EFFICIENT
  Recommended:  tuning.profile: fast

Export: events_full
  Strategy:     full-scan
  Mode:         full
  Row estimate: ~5M
  Scan type:    Seq Scan on events
  Verdict:      DEGRADED
  Recommended:  tuning.profile: safe
  Suggestion:   No index detected -- full table scan. Add an indexed cursor
                column and switch to incremental mode. Use 'safe' tuning
                profile to limit database impact.

Export: orders_chunked
  Strategy:     chunked-parallel(id, size=100000, p=4)
  Mode:         chunked (column: id, size: 100000)
  Row estimate: ~10M
  Cursor range: 1 .. 50000000
  Scan type:    Index Scan using orders_pkey
  Verdict:      ACCEPTABLE
  Recommended:  tuning.profile: safe
  Warning:      Sparse key range: ~99% of chunk windows will be empty ...
  Suggestion:   Large dataset (~10M rows). Add parallel > 1 to speed up ...
```

### Strategy Names

| Strategy | When |
|---|---|
| `full-scan` | `mode: full`, parallel=1 |
| `full-parallel(N)` | `mode: full`, parallel > 1 |
| `incremental(col)` | `mode: incremental` |
| `chunked(col, size=N)` | `mode: chunked`, parallel=1 |
| `chunked-parallel(col, size=N, p=P)` | `mode: chunked`, parallel > 1 |
| `time-window(col, Nd)` | `mode: time_window` |

### Profile Recommendation

`rivet check` recommends a tuning profile based on row estimate and index usage:

| Condition | Recommendation |
|---|---|
| Indexed, < 1M rows | `fast` |
| Indexed, 1M-10M rows | `balanced` |
| Indexed, > 10M rows | `safe` |
| No index, < 100K rows | `fast` (or `balanced` with parallel) |
| No index, 100K-1M rows | `balanced` |
| No index, > 1M rows | `safe` |

### Warnings

| Warning | Trigger |
|---|---|
| **Sparse key range** | Chunked mode with < 10% density (range >> row count) |
| **Dense surrogate sort cost** | Query uses `ROW_NUMBER()` in chunked mode |
| **Parallel memory risk** | `parallel > 1` on > 5M rows |

### Verdicts

| Verdict | Meaning |
|---|---|
| `EFFICIENT` | Index scan on cursor column, reasonable row count (< 10M) |
| `ACCEPTABLE` | Index scan but very large dataset, or partial index coverage |
| `DEGRADED` | Full table scan detected, but row count is manageable |
| `UNSAFE` | Full scan on very large table (> 50M rows) without index support |

Suggestions are mode-aware: full exports recommend switching to incremental, chunked exports recommend indexing the chunk column, time-window exports recommend indexing the time column.

## Incremental Mode

When `mode: incremental` is set, rivet:

1. Reads the last exported cursor value from its SQLite state database
2. Appends `WHERE <cursor_column> > '<last_value>'` to the query
3. After a successful export, updates the cursor to the last row's value

The state database (`.rivet_state.db`) is created next to your config file.

## Chunked Extraction

Rivet never loads an entire table into memory with a single query. Instead:

- **PostgreSQL:** Uses server-side cursors (`DECLARE CURSOR` / `FETCH N`) to read `batch_size` rows at a time
- **MySQL:** Uses streaming result sets (`query_iter()`) to read rows incrementally

Between each batch, rivet sleeps for `throttle_ms` milliseconds, giving the database breathing room.

### Sparse IDs (gaps in the key range)

Chunked mode uses `MIN(chunk_column)` and `MAX(chunk_column)` from your export query, then issues `WHERE chunk_column BETWEEN start AND end` for each window. If the primary key is sparse (huge spread between min and max, few rows), most windows cover **no rows** but the database still plans and scans for each range.

**Mitigation:** chunk on a dense surrogate computed in SQL, for example `ROW_NUMBER() OVER (ORDER BY id) AS chunk_rownum`, and set `chunk_column: chunk_rownum` in the export. Then min/max match the row count, not the physical id span. A commented PostgreSQL example lives at [tests/fixtures/migrations/001_sparse_chunk_column_example.sql](tests/fixtures/migrations/001_sparse_chunk_column_example.sql).

**Cost tradeoff:** `ORDER BY id` (and therefore that window) is not free. The planner usually needs a global ordering of the rows you export: often a **sort** over the whole result, or an **index scan on `id`** if the shape of the query allows it -- either way you pay once per export pass, and under concurrent writes the ordering is tied to a snapshot. You are trading many cheap-but-useless `BETWEEN` probes on a sparse key for fewer chunk queries that each touch real rows, at the price of establishing dense row numbers. For very large or hot tables, prefer **`incremental`** mode on an indexed cursor column, a **precomputed dense key** (column or side table populated by batch jobs), or a **materialized view** refreshed off the critical path, if that fits your workload better than a window over live data.

## Run Summary

After each export, Rivet prints a structured summary to stdout:

```
── orders ──
  run_id:      orders_20260329T125109.336
  status:      success
  rows:        150000
  files:       1
  bytes:       12.4 MB
  duration:    3.2s
  peak RSS:    142MB
  validated:   pass
  schema:      unchanged
```

All summary fields are also persisted to the metrics table and visible via `rivet metrics`. The `run_id` links the summary to the corresponding rows in `export_metrics` and `file_manifest` tables.

| Field | Description |
|-------|-------------|
| `run_id` | Canonical identifier for this run (links summary, metrics, and files) |
| `status` | `success` or `failed` |
| `rows` | Total rows extracted |
| `files` | Number of files produced (1 for single-file modes; N for chunked) |
| `bytes` | Total file size before upload |
| `duration` | Wall-clock time for the export |
| `peak RSS` | Peak process RSS during the export (MB) |
| `retries` | Number of retry attempts (0 if no retries needed) |
| `validated` | `pass` if `--validate` succeeded; omitted if not requested |
| `schema` | `unchanged` or `CHANGED`; omitted on first run |
| `error` | Error message (only on failure) |

### File manifest

Every file produced by Rivet is recorded in the `file_manifest` table. Use `rivet state files` to inspect:

```
$ rivet state files --config rivet.yaml
RUN ID                              FILE                                         ROWS      BYTES CREATED
--------------------------------------------------------------------------------------------------------------
orders_20260329T125143.912          orders_20260329_125200_chunk3.parquet        50000    17.4 MB 2026-03-29T12:52:00+00:00
orders_20260329T125143.912          orders_20260329_125156_chunk2.parquet        50000    17.4 MB 2026-03-29T12:51:56+00:00
```

This enables post-run reconciliation: verify which run created which files and confirm row counts match expectations.

## Execution Semantics

### Export lifecycle

Every export follows a strict sequence. Steps that fail cause the entire export to fail; state is never updated on failure.

```
1. Config load + validation
2. State read (load cursor for incremental; load schema for tracking)
3. Source connect (new connection per attempt)
4. Query start
   - full/incremental: single query
   - chunked: detect min/max, generate range queries
   - time_window: rewrite query with WHERE clause
5. Batch loop
   a. FETCH batch_size rows → Arrow RecordBatch
   b. FormatWriter.write_batch() → temp file (flush per batch)
   c. Sleep throttle_ms
   d. Repeat until source exhausted
6. FormatWriter.finish() → finalize temp file
7. Validate (if --validate): read back temp file, compare row count
8. Destination.write() → upload temp file to local/S3/GCS
9. State update (incremental only): advance cursor to last row's value
10. Schema tracking: compare columns with stored schema, warn on change
11. Metrics: record run result (duration, rows, RSS, status)
```

### State update point

The cursor advances **only after** step 8 (successful upload). If any step fails, the cursor stays at its previous value. This means:

- A failed export can be safely re-run without skipping data.
- A successful upload followed by a process crash before step 9 causes the **next** run to re-export rows already uploaded (at-least-once semantics -- see Duplicates below).

### Duplicates

Rivet provides **at-least-once** delivery. Duplicates can occur in these scenarios:

| Scenario | Cause | Mitigation |
|----------|-------|------------|
| Crash after upload, before cursor update | Cursor is not advanced; next run re-exports the same window | Downstream dedup on primary key + cursor column |
| `time_window` with overlapping windows | Rows near the boundary appear in consecutive windows | Downstream dedup or idempotent merge |
| `incremental` with non-monotonic cursor | Rows inserted with cursor values older than the last exported value are missed; rows updated after export may be re-exported | Use a strictly monotonic column (e.g. auto-increment id, `updated_at` with triggers) |
| `chunked` with concurrent writes | New rows inserted during export may land in already-processed ranges | Accept overlap or run during quiescent periods |

Rivet **never** claims exactly-once delivery. Design downstream pipelines to tolerate duplicates.

### Retry semantics

On failure, Rivet classifies the error and decides whether to retry:

| Category | Retry? | Reconnect? | Extra delay | Examples |
|----------|--------|------------|-------------|----------|
| Network | yes | yes | -- | connection reset, broken pipe, DNS, SSL, EOF |
| MySQL disconnect | yes | yes | -- | server gone away, lost connection |
| Timeout | yes | no | -- | statement timeout, lock wait timeout |
| Capacity | yes | yes | +15s | too many connections, DB starting/shutting down |
| Deadlock | yes | no | +1s | deadlock detected, serialization failure |
| Auth/permission | **no** | -- | -- | permission denied, access denied, invalid credentials |
| Permanent | **no** | -- | -- | syntax error, table not found, column not found |

On each retry, a **fresh connection** is created (never reuses a failed connection). Backoff is exponential: `retry_backoff_ms * 2^(attempt-1) + extra_delay`. The tuning profile controls `max_retries` and `retry_backoff_ms`.

### Validation semantics

`--validate` re-reads the **temp file** after writing and compares the row count against the number of rows received from the source:

- **Parquet**: opens the file with the Arrow reader and reads `num_rows` from footer metadata.
- **CSV**: counts newlines (excluding header).

**What it proves:** the file on disk contains the expected number of rows (catches truncated writes, corrupt footers, I/O errors during flush).

**What it does not prove:**
- Cell-level correctness (no checksum on individual values).
- Source-to-file semantic equivalence (no re-query of the database to compare).
- Post-upload integrity (the file is validated before upload, not after).

## Supported Type Mappings

| PostgreSQL | MySQL | Arrow / Parquet |
|---|---|---|
| `BOOL` | `BIT` | Boolean |
| `INT2` / `SMALLINT` | `TINYINT`, `SMALLINT` | Int16 |
| `INT4` / `INT` | `INT`, `MEDIUMINT` | Int32 |
| `INT8` / `BIGINT` | `BIGINT` | Int64 |
| `FLOAT4` | `FLOAT` | Float32 |
| `FLOAT8` | `DOUBLE` | Float64 |
| `TEXT`, `VARCHAR` | `VARCHAR`, `TEXT` | Utf8 (String) |
| `BYTEA` | `BLOB` (binary charset) | Binary |
| `DATE` | `DATE` | Date32 |
| `TIMESTAMP(TZ)` | `DATETIME`, `TIMESTAMP` | Timestamp(us) |
| `NUMERIC` | `DECIMAL` | Utf8 (stringified) |
| `JSON` / `JSONB` | `JSON` | Utf8 |
| `UUID` | -- | Utf8 |

## Guarantees and Limitations

### What Rivet guarantees

- **At-least-once delivery**: if an export succeeds, all rows matching the query are written to at least one output file.
- **State atomicity per export**: cursor state is updated only after successful upload. A crash mid-export does not advance the cursor.
- **Schema change detection**: Rivet warns when columns are added, removed, or change type between runs.
- **Validation on demand**: `--validate` confirms row counts match between source read and file on disk.
- **Predictable auth**: credentials are resolved in a documented 4-layer order; no silent fallback surprises.

### What Rivet does NOT guarantee

- **No exactly-once delivery**: duplicates can occur on crash recovery, overlapping windows, or non-monotonic cursors.
- **No cell-level validation**: `--validate` checks row count, not individual cell values or checksums.
- **No CDC / real-time**: Rivet runs point-in-time queries; it does not read WAL, binlog, or change streams.
- **No load / merge**: Rivet produces files. Loading them into a warehouse is your responsibility.
- **No distributed execution**: Rivet runs on a single machine. `parallel` spawns threads, not remote workers.
- **No transactional consistency across exports**: each export runs its own query; there is no cross-export snapshot isolation.
- **No encryption**: output files are written in plaintext. Encrypt at the destination level if needed.

See [Execution Semantics](#execution-semantics) for detailed lifecycle, state update, duplicate, retry, and validation rules.

## Development

For **pilot documentation** (per-mode guides, destination setup, annotated YAML examples), see [docs/](docs/).

For a **step-by-step onboarding guide** (from installation to production-ready exports), see [USER_GUIDE.md](USER_GUIDE.md).

For a **manual user acceptance checklist** (CLI, modes, destinations, compression, skip-empty), see [USER_TEST_PLAN.md](USER_TEST_PLAN.md).

### Local Setup

Start PostgreSQL and MySQL with Docker:

```bash
docker compose up -d
```

Seed both databases with test data (100K users, ~1M orders, ~5M events):

```bash
cargo run --release --bin seed -- --target both --users 100000
```

The seed tool supports flags:

```
--target postgres|mysql|both    # which database to seed
--users N                       # number of users (default: 100000)
--orders-per-user N             # avg orders per user (default: 10)
--events-per-user N             # avg events per user (default: 50)
--batch-size N                  # insert batch size (default: 1000)
--pg-url URL                    # PostgreSQL connection URL
--mysql-url URL                 # MySQL connection URL
```

### Toolchain

The project pins Rust **1.94** via `rust-toolchain.toml`. Install with:

```bash
rustup install 1.94
```

### Running Tests

```bash
cargo test              # 617 unit + integration tests (no database needed)
cargo test -- --nocapture  # with output
cargo clippy --all-targets -- -D warnings  # lint check
cargo fmt --all -- --check                 # format check
```

End-to-end scripts (Docker Compose must be up, `rivet` built):

```bash
bash dev/test_permissions.sh
bash dev/test_schema_evolution.sh
```

### CI

GitHub Actions runs on every push/PR to `master`/`main`:

- **Rustfmt** — formatting check
- **Clippy** — lint check with `-D warnings`
- **Tests** — full test suite
- **Release build** — ensures `cargo build --release` succeeds
- **Security audit** — `cargo audit` via `rustsec/audit-check`

---

## Roadmap

See [rivet_roadmap.md](rivet_roadmap.md) for the full roadmap (strategy + execution status).

**Next milestones:**

| Milestone | Focus |
|-----------|-------|
| **v0.2.0** (stable) | Cross-platform release binaries, E2E test matrix, `cargo publish`, Docker image |
| **v0.3.0** | Source count reconciliation, crash/recovery tests, data shape drift detection, curated example configs |
| **Future** | CDC mode, Iceberg/Delta output, webhook destination, multi-source joins, plugin system |