rivet-cli 0.2.0-beta.2

CLI tool to export PostgreSQL and MySQL to Parquet/CSV (local, S3, GCS) with tuning, preflight checks, and SQLite-backed state.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
# Rivet

**Lightweight, source-safe data extraction from PostgreSQL and MySQL to Parquet/CSV.**

Rivet is a CLI tool that exports query results from relational databases to files -- locally or in cloud storage (S3, GCS). It is **extract-only**: no loading, no merging, no CDC. It is designed to be gentle on production databases through tuning profiles, preflight health checks, and intelligent retry with backoff.

### What Rivet does

- Extracts data from **PostgreSQL** and **MySQL** via standard SQL queries
- Writes **Parquet** (zstd-compressed by default; snappy, gzip, lz4, none) or **CSV** files
- Uploads to **local disk**, **Amazon S3**, **Google Cloud Storage**, or **stdout** (pipe workflows)
- Tracks incremental state in **SQLite** so the next run picks up where the last left off
- Diagnoses source health **before** extraction (`rivet check`)
- Verifies auth for all sources and destinations **before** running (`rivet doctor`)
- Prints a structured **run summary** after each export (run ID, rows, files, bytes, duration, RSS, retries, schema changes)
- Persists **metrics history**, **schema tracking**, and **file manifest** in SQLite
- Recommends **parallelism level** and **tuning profile** in preflight checks
- **Parameterized queries** via `--param key=value` and `${key}` placeholders
- **Data quality checks** — row count bounds, null ratio thresholds, uniqueness assertions
- **File size splitting** — `max_file_size: 512MB` automatically splits output into parts
- **Memory-based batch sizing** — `batch_size_memory_mb: 256` auto-tunes batch size from schema width
- **Slack notifications** on failure, schema change, or degraded verdict

### What Rivet does NOT do

- **No loading/merging** -- it produces files; you bring them into a warehouse yourself
- **No CDC** -- no WAL/binlog reading; query-based extraction only
- **No orchestration** -- no built-in scheduler; use cron, Airflow, or similar
- **No exactly-once delivery** -- at-least-once; duplicates are possible (see [Execution Semantics](#execution-semantics))
- **No web UI / API** -- CLI and YAML config only

**Documentation language:** English-only. See [CONTRIBUTING.md](CONTRIBUTING.md).

> **New to Rivet?** Start with the [Pilot Documentation](docs/) — step-by-step guides for every export mode, destination, and YAML parameter, plus quickstart templates for your first export.

## Installation

### Homebrew (macOS / Linux) — recommended

```bash
brew tap panchenkoai/rivet
brew update
brew install rivet
rivet --version
```

### Pre-built binaries

Download the latest release for your platform from [GitHub Releases](https://github.com/panchenkoai/rivet/releases):

```bash
# macOS (Apple Silicon)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-aarch64-apple-darwin.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-x86_64-apple-darwin.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-x86_64-unknown-linux-gnu.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/

# Linux (arm64)
curl -L https://github.com/panchenkoai/rivet/releases/latest/download/rivet-aarch64-unknown-linux-gnu.tar.gz | tar xz
sudo mv rivet-*/rivet /usr/local/bin/
```

Verify:

```bash
rivet --version
```

### Docker

Try Rivet without installing anything — mount your config and output directory:

```bash
docker run --rm \
  -v $(pwd)/rivet.yaml:/config/rivet.yaml \
  -v $(pwd)/output:/output \
  ghcr.io/panchenkoai/rivet:latest \
  run --config /config/rivet.yaml
```

Or check the version and explore commands:

```bash
docker run --rm ghcr.io/panchenkoai/rivet:latest --version
docker run --rm ghcr.io/panchenkoai/rivet:latest --help
```

Pass environment variables for credentials:

```bash
docker run --rm \
  -e DATABASE_URL="postgresql://user:pass@host:5432/db" \
  -v $(pwd)/rivet.yaml:/config/rivet.yaml \
  -v $(pwd)/output:/output \
  ghcr.io/panchenkoai/rivet:latest \
  run --config /config/rivet.yaml
```

> **Note:** To connect to a database running on your host machine, use `host.docker.internal` instead of `localhost` in the connection URL.

### Build from source

Requires Rust 1.94+:

```bash
git clone https://github.com/panchenkoai/rivet.git
cd rivet
cargo build --release
# binary is at target/release/rivet
```

## Quick Start

1. Create a config file `rivet.yaml`:

```yaml
source:
  type: postgres
  url: "postgresql://user:pass@localhost:5432/mydb"
  tuning:
    profile: safe

exports:
  - name: users
    query: "SELECT id, name, email, updated_at FROM users"
    mode: incremental
    cursor_column: updated_at
    format: parquet
    destination:
      type: local
      path: ./output
```

2. Run preflight check to diagnose source health:

```bash
rivet check --config rivet.yaml
```

3. Verify auth for source and all destinations:

```bash
rivet doctor --config rivet.yaml
```

4. Run the export:

```bash
RUST_LOG=info rivet run --config rivet.yaml
```

5. Check state:

```bash
rivet state show --config rivet.yaml
```

## Working with the binary

Once installed, `rivet` is a single self-contained binary with no runtime dependencies (no JVM, no Python, no Docker required).

**Typical workflow:**

```bash
# 1. Preflight: check that the source DB is reachable and healthy
rivet check --config rivet.yaml

# 2. Auth: verify credentials for source + all destinations (S3, GCS, etc.)
rivet doctor --config rivet.yaml

# 3. Export: run all exports defined in the config
RUST_LOG=info rivet run --config rivet.yaml

# 4. Inspect: view cursor state and file manifest
rivet state show --config rivet.yaml
rivet state files --config rivet.yaml

# 5. Re-run: only new/changed rows are exported (incremental mode)
RUST_LOG=info rivet run --config rivet.yaml
```

**Useful flags:**

```bash
rivet run --config rivet.yaml --export users      # run a single export
rivet run --config rivet.yaml --validate           # reconcile row counts after write
rivet run --config rivet.yaml --param env=prod     # parameterized queries
rivet state reset --config rivet.yaml --export users  # reset cursor to re-export from scratch
```

**Logging:**

Rivet uses `RUST_LOG` for verbosity:

```bash
RUST_LOG=debug rivet run --config rivet.yaml    # verbose (SQL, batch timings, retries)
RUST_LOG=info  rivet run --config rivet.yaml    # normal (progress, summary)
RUST_LOG=warn  rivet run --config rivet.yaml    # quiet (errors and warnings only)
```

**Shell completions:**

```bash
# Bash
rivet completions bash > ~/.local/share/bash-completion/completions/rivet

# Zsh
rivet completions zsh > ~/.zfunc/_rivet

# Fish
rivet completions fish > ~/.config/fish/completions/rivet.fish
```

## CLI Reference

```
rivet run --config <path>                          # run all exports
rivet run --config <path> --export <name>          # run a specific export
rivet run --config <path> --validate               # verify row counts after write
rivet check --config <path>                        # preflight check all exports
rivet check --config <path> --export <name>        # preflight check one export
rivet doctor --config <path>                       # verify source + destination auth
rivet state show --config <path>                   # show cursor state
rivet state reset --config <path> --export <name>  # reset cursor
rivet state files --config <path>                  # show file manifest (which run created which files)
rivet metrics --config <path>                      # show export run history
rivet metrics --config <path> --export <name>      # metrics for one export
rivet metrics --config <path> --last N             # last N runs (default 20)
rivet completions <shell>                          # generate shell completions (bash|zsh|fish|powershell)
```

**Shell completions:**

```bash
# zsh (add to ~/.zshrc)
rivet completions zsh > ~/.zfunc/_rivet

# bash
rivet completions bash > /etc/bash_completion.d/rivet

# fish
rivet completions fish > ~/.config/fish/completions/rivet.fish
```

Set `RUST_LOG=info` (or `debug`) for detailed logging:

```bash
RUST_LOG=info rivet run --config rivet.yaml
```

## Choosing a Mode

| Mode | Best for | Key behavior |
|------|----------|--------------|
| `full` | Small tables, snapshots, one-off exports | Exports entire query result every run |
| `incremental` | Append-only or update-tracked tables | Resumes from the last exported value of `cursor_column` |
| `chunked` | Very large tables (10M+ rows) | Splits into ID-range windows; supports `parallel > 1` for concurrent extraction |
| `time_window` | Event logs, append-mostly data with timestamps | Exports only the last N days based on a time/date column |

**Decision rules:**

1. **Table < 1M rows, full snapshot needed** -- use `full`.
2. **Table has a monotonically increasing column** (auto-increment id, `updated_at` with triggers) -- use `incremental`. This is the most efficient mode for repeated runs.
3. **Table is very large and you need parallel extraction** -- use `chunked` with `parallel > 1`. Set `chunk_column` to the primary key. Watch out for sparse IDs (see [Sparse IDs](#sparse-ids-gaps-in-the-key-range)).
4. **You only need recent data** (e.g. last 7 days of events) -- use `time_window`. Set `time_column` and `days_window`.

**Can I combine modes?** No. Each export uses exactly one mode. If you need both incremental tracking and chunked extraction, use `incremental` for ongoing syncs and `chunked` for backfills.

## Choosing a Profile

| Profile | Source environment | Behavior |
|---------|--------------------|----------|
| `fast` | Dedicated replica, data warehouse, trusted environment | Large batches, no throttle, no timeouts, minimal retries |
| `balanced` | General-purpose source, moderate concurrent load | 10K batch, 50ms throttle, 5-min statement timeout, 3 retries |
| `safe` | Production OLTP, shared resources, fragile source | Small batches, 500ms throttle, 2-min timeout, 5 retries with long backoff |

**Decision rules:**

1. **Dedicated read replica or analytics database** -- `fast`. You own the capacity.
2. **Production database with other workloads** -- `balanced`. Good default.
3. **Production OLTP under high load, or a database you don't fully control** -- `safe`. Rivet backs off aggressively and retries patiently.

You can always override individual fields (e.g. `profile: safe` with `batch_size: 5000`).

## Config Reference

### Source

Two mutually exclusive styles for specifying database credentials:

**URL-based** -- set exactly one of `url`, `url_env`, or `url_file`:

```yaml
source:
  type: postgres   # postgres | mysql
  url: "postgresql://user:pass@host:port/db"
  tuning:          # optional, defaults to balanced
    profile: safe  # safe | balanced | fast
```

```yaml
source:
  type: mysql
  url_env: DATABASE_URL   # read full URL from this env var
```

**Structured** -- specify individual connection fields:

```yaml
source:
  type: postgres
  host: db.example.com
  port: 5433              # optional; defaults to 5432 (PG) / 3306 (MySQL)
  user: admin
  password_env: DB_PASS   # reads password from env var; or use 'password: literal'
  database: mydb
```

| Field | Required | Notes |
|-------|----------|-------|
| `host` | yes | |
| `user` | yes | |
| `database` | yes | |
| `port` | no | defaults to 5432 (postgres) / 3306 (mysql) |
| `password` | no | plaintext; prefer `password_env` |
| `password_env` | no | env var name containing the password |

URL-based and structured fields **cannot** be mixed. If both are present, validation rejects the config with a clear error.

### Source Tuning

Controls how aggressively rivet reads from the database. Three named profiles with individual field overrides:

```yaml
source:
  type: postgres
  url: "..."
  tuning:
    profile: safe              # base profile
    batch_size: 3000           # override: rows per fetch
    throttle_ms: 300           # override: sleep between fetches
    statement_timeout_s: 60    # override: per-query timeout
```

#### Profile Defaults

| Parameter | `fast` | `balanced` (default) | `safe` |
|---|---|---|---|
| `batch_size` | 50,000 | 10,000 | 2,000 |
| `throttle_ms` | 0 | 50 | 500 |
| `statement_timeout_s` | 0 (none) | 300 | 120 |
| `max_retries` | 1 | 3 | 5 |
| `retry_backoff_ms` | 1,000 | 2,000 | 5,000 |
| `lock_timeout_s` | 0 (none) | 30 | 10 |

**When to use each profile:**

- **fast** -- trusted environment, dedicated replica, need maximum throughput
- **balanced** -- general purpose, moderate load on source
- **safe** -- production OLTP database, shared resources, fragile source

If no `tuning` section is specified, `balanced` is used.

### Exports

Each export defines a query, format, mode, and destination:

```yaml
exports:
  - name: my_export            # unique name, used for state tracking
    query: "SELECT ..."        # SQL query to execute
    mode: full                 # full | incremental
    cursor_column: updated_at  # required for incremental mode
    format: parquet            # parquet | csv
    destination:
      type: local              # local | s3 | gcs
      path: ./output           # local: output directory
```

### Meta Columns

Add metadata columns to every output row -- useful for deduplication and lineage on the raw/staging layer.

```yaml
exports:
  - name: page_views
    query: "SELECT * FROM page_views"
    format: parquet
    meta_columns:
      exported_at: true   # adds _rivet_exported_at (UTC timestamp)
      row_hash: true      # adds _rivet_row_hash (xxh3_128 hex)
    destination:
      type: gcs
      bucket: my-bucket
```

| Column | Type | Description |
|--------|------|-------------|
| `_rivet_exported_at` | `Timestamp(us, UTC)` | When the batch was exported (same value for all rows in a batch) |
| `_rivet_row_hash` | `Int64` | Lower 64 bits of xxHash3-128 over all column values. Integer for fast `PARTITION BY` / `JOIN`. |

**Dedup pattern** (e.g. in BigQuery / DuckDB):

```sql
SELECT * FROM raw_page_views
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY _rivet_row_hash
  ORDER BY _rivet_exported_at DESC
) = 1
```

Both fields are optional and default to `false`. When disabled, no extra columns are added.

### Compression

Parquet compression is configurable per export. Default: **zstd** (better compression ratio than Snappy at comparable speed).

```yaml
exports:
  - name: orders
    query: "SELECT * FROM orders"
    format: parquet
    compression: zstd            # zstd | snappy | gzip | lz4 | none
    compression_level: 9         # optional; zstd 1..22 (default 3), gzip 0..10 (default 6)
    destination:
      type: local
      path: ./output
```

| Codec | Default level | Notes |
|-------|--------------|-------|
| `zstd` | 3 | Best ratio/speed tradeoff; new default |
| `snappy` | — | Fast, modest compression; previous default |
| `gzip` | 6 | Wide compatibility |
| `lz4` | — | Very fast decompression |
| `none` | — | No compression; largest files |

CSV exports ignore the compression setting.

### Skip Empty Exports

When running scheduled/incremental exports, zero new rows often means nothing changed. Use `skip_empty` to avoid creating empty files:

```yaml
exports:
  - name: events_inc
    query: "SELECT * FROM events"
    mode: incremental
    cursor_column: updated_at
    format: parquet
    skip_empty: true             # no file created when 0 rows; cursor not advanced
    destination:
      type: gcs
      bucket: my-bucket
```

When `skip_empty: true` and the query returns 0 rows:
- No output file is created or uploaded
- Cursor state is **not** advanced (safe to rerun)
- Run summary shows `status: skipped`

Default: `false` (current behavior; 0-row exports still succeed with no file output).

### Destinations

**Local filesystem:**

```yaml
destination:
  type: local
  path: ./output
```

**Amazon S3:**

```yaml
destination:
  type: s3
  bucket: my-bucket
  prefix: exports/data/
  region: us-east-1
  endpoint: https://...       # optional, for S3-compatible storage
```

Credentials: either omit key env fields and use the default AWS chain, or set **both** `access_key_env` and `secret_key_env`. Details: [Credential precedence](#credential-precedence).

**Google Cloud Storage:**

```yaml
destination:
  type: gcs
  bucket: my-bucket
  prefix: exports/data/
  endpoint: https://...       # optional
  credentials_file: /path/to/sa.json   # optional; omit to use ADC / env (see below)
```

**GCS -- credentials:** see [Credential precedence](#credential-precedence). For day-to-day use on a workstation with a Google Cloud project, run `gcloud auth application-default login` and omit `credentials_file`; Rivet then uses **Application Default Credentials** (ADC).

**Stdout (pipe to another tool):**

```yaml
destination:
  type: stdout
```

Writes file contents directly to stdout. Useful for piping into `gzip`, `aws s3 cp -`, or other streaming consumers. Only practical with a single export (multiple exports would interleave output).

## Credential precedence

Rivet uses one predictable model for *where* secrets come from. Think of four layers (highest priority first). A higher layer **wins** when it applies; Rivet does **not** merge multiple cloud credential sources for the same destination.

| Priority | Layer | Meaning |
|----------|--------|---------|
| 1 | **Config** | Fields in `rivet.yaml` (URLs, `credentials_file`, names of env vars for S3 keys). |
| 2 | **Environment variables** | Process environment (`DATABASE_URL` via `url_env`, `${VAR}` expansion in `url`, `GOOGLE_APPLICATION_CREDENTIALS`, standard `AWS_*` variables). |
| 3 | **ADC / instance identity** | Provider default credentials with **no** explicit path in Rivet config (e.g. GCE/GKE metadata; local user ADC from `gcloud auth application-default login`). |
| 4 | **File-based material** | Secret **content** read from disk when a path is chosen by config or environment (e.g. `url_file`, `credentials_file`, or the file pointed to by `GOOGLE_APPLICATION_CREDENTIALS`). This is not a separate "guess"; it is always wired through layer 1 or 2. |

### Database (PostgreSQL / MySQL)

Two mutually exclusive styles:

**URL-based** -- set exactly **one** of `source.url`, `source.url_env`, or `source.url_file`. There is **no** fallback between them.

| Mechanism | Resolution |
|-----------|----------------|
| `url` | Connection string from config. Placeholders `${VAR}` are expanded from the environment when the config file is loaded (missing variables become empty). |
| `url_env` | The **entire** URL is read from the named environment variable. |
| `url_file` | The **entire** URL is read from the file path given in config (trimmed). |

**Structured** -- set `host`, `user`, `database` (and optionally `port`, `password` / `password_env`). Rivet builds the connection URL internally.

Cloud "ADC" does not apply to database URLs.

### Google Cloud Storage (GCS)

| Step | Source |
|------|--------|
| 1 | If `destination.credentials_file` is set -- use **only** that service account JSON path (config overrides env). |
| 2 | Else -- OpenDAL uses Google's default loader: `GOOGLE_APPLICATION_CREDENTIALS` (if set) -- JSON file at that path. |
| 3 | Else -- user ADC file from `gcloud auth application-default login` (well-known path under gcloud config). |
| 4 | Else -- GCE/GKE metadata-based service account when running on Google Cloud. |

If you omit `credentials_file`, set `RUST_LOG=info` and look for a log line stating that the default Google credential chain is in use.

### Amazon S3

| Step | Source |
|------|--------|
| 1 | If **both** `access_key_env` and `secret_key_env` are set -- read access key and secret **only** from those variable names (error if unset). |
| 2 | If **neither** is set -- OpenDAL's default AWS chain: environment variables, shared config files (e.g. `~/.aws/credentials`), then EC2/ECS instance metadata (IAM role). |

Setting **only one** of `access_key_env` or `secret_key_env` is invalid and rejected at config validation.

## Auth Diagnostics

`rivet doctor` verifies that source and destination credentials are valid before you run any exports:

```
$ rivet doctor --config rivet.yaml

rivet doctor: verifying auth for config 'rivet.yaml'

[OK]  Config parsed successfully
[OK]  Source auth (Postgres)
[OK]  Destination S3(my-bucket)
[FAIL] Destination GCS(other-bucket) -- auth error: loading credential ...

Some checks failed. Fix the issues above before running exports.
```

Error categories:

| Category | Meaning |
|----------|---------|
| `auth error` | Credentials are missing, expired, or rejected |
| `connectivity error` | Cannot reach the host (DNS, firewall, timeout) |
| `bucket not found` | Bucket or path does not exist |
| `error` | Other / uncategorized |

## Preflight Check

`rivet check` analyzes each export before running it. It connects to the source database, runs `EXPLAIN` on each query, and reports strategy, row estimates, verdicts, profile recommendations, and warnings:

```
$ rivet check --config rivet.yaml

Export: orders_incremental
  Strategy:     incremental(updated_at)
  Mode:         incremental (cursor: updated_at)
  Row estimate: ~1M
  Cursor range: 2024-01-01 .. 2025-01-30
  Scan type:    Index Scan using idx_orders_updated_at
  Verdict:      EFFICIENT
  Recommended:  tuning.profile: fast

Export: events_full
  Strategy:     full-scan
  Mode:         full
  Row estimate: ~5M
  Scan type:    Seq Scan on events
  Verdict:      DEGRADED
  Recommended:  tuning.profile: safe
  Suggestion:   No index detected -- full table scan. Add an indexed cursor
                column and switch to incremental mode. Use 'safe' tuning
                profile to limit database impact.

Export: orders_chunked
  Strategy:     chunked-parallel(id, size=100000, p=4)
  Mode:         chunked (column: id, size: 100000)
  Row estimate: ~10M
  Cursor range: 1 .. 50000000
  Scan type:    Index Scan using orders_pkey
  Verdict:      ACCEPTABLE
  Recommended:  tuning.profile: safe
  Warning:      Sparse key range: ~99% of chunk windows will be empty ...
  Suggestion:   Large dataset (~10M rows). Add parallel > 1 to speed up ...
```

### Strategy Names

| Strategy | When |
|---|---|
| `full-scan` | `mode: full`, parallel=1 |
| `full-parallel(N)` | `mode: full`, parallel > 1 |
| `incremental(col)` | `mode: incremental` |
| `chunked(col, size=N)` | `mode: chunked`, parallel=1 |
| `chunked-parallel(col, size=N, p=P)` | `mode: chunked`, parallel > 1 |
| `time-window(col, Nd)` | `mode: time_window` |

### Profile Recommendation

`rivet check` recommends a tuning profile based on row estimate and index usage:

| Condition | Recommendation |
|---|---|
| Indexed, < 1M rows | `fast` |
| Indexed, 1M-10M rows | `balanced` |
| Indexed, > 10M rows | `safe` |
| No index, < 100K rows | `fast` (or `balanced` with parallel) |
| No index, 100K-1M rows | `balanced` |
| No index, > 1M rows | `safe` |

### Warnings

| Warning | Trigger |
|---|---|
| **Sparse key range** | Chunked mode with < 10% density (range >> row count) |
| **Dense surrogate sort cost** | Query uses `ROW_NUMBER()` in chunked mode |
| **Parallel memory risk** | `parallel > 1` on > 5M rows |

### Verdicts

| Verdict | Meaning |
|---|---|
| `EFFICIENT` | Index scan on cursor column, reasonable row count (< 10M) |
| `ACCEPTABLE` | Index scan but very large dataset, or partial index coverage |
| `DEGRADED` | Full table scan detected, but row count is manageable |
| `UNSAFE` | Full scan on very large table (> 50M rows) without index support |

Suggestions are mode-aware: full exports recommend switching to incremental, chunked exports recommend indexing the chunk column, time-window exports recommend indexing the time column.

## Incremental Mode

When `mode: incremental` is set, rivet:

1. Reads the last exported cursor value from its SQLite state database
2. Appends `WHERE <cursor_column> > '<last_value>'` to the query
3. After a successful export, updates the cursor to the last row's value

The state database (`.rivet_state.db`) is created next to your config file.

## Chunked Extraction

Rivet never loads an entire table into memory with a single query. Instead:

- **PostgreSQL:** Uses server-side cursors (`DECLARE CURSOR` / `FETCH N`) to read `batch_size` rows at a time
- **MySQL:** Uses streaming result sets (`query_iter()`) to read rows incrementally

Between each batch, rivet sleeps for `throttle_ms` milliseconds, giving the database breathing room.

### Sparse IDs (gaps in the key range)

Chunked mode uses `MIN(chunk_column)` and `MAX(chunk_column)` from your export query, then issues `WHERE chunk_column BETWEEN start AND end` for each window. If the primary key is sparse (huge spread between min and max, few rows), most windows cover **no rows** but the database still plans and scans for each range.

**Mitigation:** chunk on a dense surrogate computed in SQL, for example `ROW_NUMBER() OVER (ORDER BY id) AS chunk_rownum`, and set `chunk_column: chunk_rownum` in the export. Then min/max match the row count, not the physical id span. A commented PostgreSQL example lives at [tests/fixtures/migrations/001_sparse_chunk_column_example.sql](tests/fixtures/migrations/001_sparse_chunk_column_example.sql).

**Cost tradeoff:** `ORDER BY id` (and therefore that window) is not free. The planner usually needs a global ordering of the rows you export: often a **sort** over the whole result, or an **index scan on `id`** if the shape of the query allows it -- either way you pay once per export pass, and under concurrent writes the ordering is tied to a snapshot. You are trading many cheap-but-useless `BETWEEN` probes on a sparse key for fewer chunk queries that each touch real rows, at the price of establishing dense row numbers. For very large or hot tables, prefer **`incremental`** mode on an indexed cursor column, a **precomputed dense key** (column or side table populated by batch jobs), or a **materialized view** refreshed off the critical path, if that fits your workload better than a window over live data.

## Run Summary

After each export, Rivet prints a structured summary to stdout:

```
── orders ──
  run_id:      orders_20260329T125109.336
  status:      success
  rows:        150000
  files:       1
  bytes:       12.4 MB
  duration:    3.2s
  peak RSS:    142MB
  validated:   pass
  schema:      unchanged
```

All summary fields are also persisted to the metrics table and visible via `rivet metrics`. The `run_id` links the summary to the corresponding rows in `export_metrics` and `file_manifest` tables.

| Field | Description |
|-------|-------------|
| `run_id` | Canonical identifier for this run (links summary, metrics, and files) |
| `status` | `success` or `failed` |
| `rows` | Total rows extracted |
| `files` | Number of files produced (1 for single-file modes; N for chunked) |
| `bytes` | Total file size before upload |
| `duration` | Wall-clock time for the export |
| `peak RSS` | Peak process RSS during the export (MB) |
| `retries` | Number of retry attempts (0 if no retries needed) |
| `validated` | `pass` if `--validate` succeeded; omitted if not requested |
| `schema` | `unchanged` or `CHANGED`; omitted on first run |
| `error` | Error message (only on failure) |

### File manifest

Every file produced by Rivet is recorded in the `file_manifest` table. Use `rivet state files` to inspect:

```
$ rivet state files --config rivet.yaml
RUN ID                              FILE                                         ROWS      BYTES CREATED
--------------------------------------------------------------------------------------------------------------
orders_20260329T125143.912          orders_20260329_125200_chunk3.parquet        50000    17.4 MB 2026-03-29T12:52:00+00:00
orders_20260329T125143.912          orders_20260329_125156_chunk2.parquet        50000    17.4 MB 2026-03-29T12:51:56+00:00
```

This enables post-run reconciliation: verify which run created which files and confirm row counts match expectations.

## Execution Semantics

### Export lifecycle

Every export follows a strict sequence. Steps that fail cause the entire export to fail; state is never updated on failure.

```
1. Config load + validation
2. State read (load cursor for incremental; load schema for tracking)
3. Source connect (new connection per attempt)
4. Query start
   - full/incremental: single query
   - chunked: detect min/max, generate range queries
   - time_window: rewrite query with WHERE clause
5. Batch loop
   a. FETCH batch_size rows → Arrow RecordBatch
   b. FormatWriter.write_batch() → temp file (flush per batch)
   c. Sleep throttle_ms
   d. Repeat until source exhausted
6. FormatWriter.finish() → finalize temp file
7. Validate (if --validate): read back temp file, compare row count
8. Destination.write() → upload temp file to local/S3/GCS
9. State update (incremental only): advance cursor to last row's value
10. Schema tracking: compare columns with stored schema, warn on change
11. Metrics: record run result (duration, rows, RSS, status)
```

### State update point

The cursor advances **only after** step 8 (successful upload). If any step fails, the cursor stays at its previous value. This means:

- A failed export can be safely re-run without skipping data.
- A successful upload followed by a process crash before step 9 causes the **next** run to re-export rows already uploaded (at-least-once semantics -- see Duplicates below).

### Duplicates

Rivet provides **at-least-once** delivery. Duplicates can occur in these scenarios:

| Scenario | Cause | Mitigation |
|----------|-------|------------|
| Crash after upload, before cursor update | Cursor is not advanced; next run re-exports the same window | Downstream dedup on primary key + cursor column |
| `time_window` with overlapping windows | Rows near the boundary appear in consecutive windows | Downstream dedup or idempotent merge |
| `incremental` with non-monotonic cursor | Rows inserted with cursor values older than the last exported value are missed; rows updated after export may be re-exported | Use a strictly monotonic column (e.g. auto-increment id, `updated_at` with triggers) |
| `chunked` with concurrent writes | New rows inserted during export may land in already-processed ranges | Accept overlap or run during quiescent periods |

Rivet **never** claims exactly-once delivery. Design downstream pipelines to tolerate duplicates.

### Retry semantics

On failure, Rivet classifies the error and decides whether to retry:

| Category | Retry? | Reconnect? | Extra delay | Examples |
|----------|--------|------------|-------------|----------|
| Network | yes | yes | -- | connection reset, broken pipe, DNS, SSL, EOF |
| MySQL disconnect | yes | yes | -- | server gone away, lost connection |
| Timeout | yes | no | -- | statement timeout, lock wait timeout |
| Capacity | yes | yes | +15s | too many connections, DB starting/shutting down |
| Deadlock | yes | no | +1s | deadlock detected, serialization failure |
| Auth/permission | **no** | -- | -- | permission denied, access denied, invalid credentials |
| Permanent | **no** | -- | -- | syntax error, table not found, column not found |

On each retry, a **fresh connection** is created (never reuses a failed connection). Backoff is exponential: `retry_backoff_ms * 2^(attempt-1) + extra_delay`. The tuning profile controls `max_retries` and `retry_backoff_ms`.

### Validation semantics

`--validate` re-reads the **temp file** after writing and compares the row count against the number of rows received from the source:

- **Parquet**: opens the file with the Arrow reader and reads `num_rows` from footer metadata.
- **CSV**: counts newlines (excluding header).

**What it proves:** the file on disk contains the expected number of rows (catches truncated writes, corrupt footers, I/O errors during flush).

**What it does not prove:**
- Cell-level correctness (no checksum on individual values).
- Source-to-file semantic equivalence (no re-query of the database to compare).
- Post-upload integrity (the file is validated before upload, not after).

## Supported Type Mappings

| PostgreSQL | MySQL | Arrow / Parquet |
|---|---|---|
| `BOOL` | `BIT` | Boolean |
| `INT2` / `SMALLINT` | `TINYINT`, `SMALLINT` | Int16 |
| `INT4` / `INT` | `INT`, `MEDIUMINT` | Int32 |
| `INT8` / `BIGINT` | `BIGINT` | Int64 |
| `FLOAT4` | `FLOAT` | Float32 |
| `FLOAT8` | `DOUBLE` | Float64 |
| `TEXT`, `VARCHAR` | `VARCHAR`, `TEXT` | Utf8 (String) |
| `BYTEA` | `BLOB` (binary charset) | Binary |
| `DATE` | `DATE` | Date32 |
| `TIMESTAMP(TZ)` | `DATETIME`, `TIMESTAMP` | Timestamp(us) |
| `NUMERIC` | `DECIMAL` | Utf8 (stringified) |
| `JSON` / `JSONB` | `JSON` | Utf8 |
| `UUID` | -- | Utf8 |

## Guarantees and Limitations

### What Rivet guarantees

- **At-least-once delivery**: if an export succeeds, all rows matching the query are written to at least one output file.
- **State atomicity per export**: cursor state is updated only after successful upload. A crash mid-export does not advance the cursor.
- **Schema change detection**: Rivet warns when columns are added, removed, or change type between runs.
- **Validation on demand**: `--validate` confirms row counts match between source read and file on disk.
- **Predictable auth**: credentials are resolved in a documented 4-layer order; no silent fallback surprises.

### What Rivet does NOT guarantee

- **No exactly-once delivery**: duplicates can occur on crash recovery, overlapping windows, or non-monotonic cursors.
- **No cell-level validation**: `--validate` checks row count, not individual cell values or checksums.
- **No CDC / real-time**: Rivet runs point-in-time queries; it does not read WAL, binlog, or change streams.
- **No load / merge**: Rivet produces files. Loading them into a warehouse is your responsibility.
- **No distributed execution**: Rivet runs on a single machine. `parallel` spawns threads, not remote workers.
- **No transactional consistency across exports**: each export runs its own query; there is no cross-export snapshot isolation.
- **No encryption**: output files are written in plaintext. Encrypt at the destination level if needed.

See [Execution Semantics](#execution-semantics) for detailed lifecycle, state update, duplicate, retry, and validation rules.

## Development

For **pilot documentation** (per-mode guides, destination setup, annotated YAML examples), see [docs/](docs/).

For a **step-by-step onboarding guide** (from installation to production-ready exports), see [USER_GUIDE.md](USER_GUIDE.md).

For a **manual user acceptance checklist** (CLI, modes, destinations, compression, skip-empty), see [USER_TEST_PLAN.md](USER_TEST_PLAN.md).

### Local Setup

Start PostgreSQL and MySQL with Docker:

```bash
docker compose up -d
```

Seed both databases with test data (100K users, ~1M orders, ~5M events):

```bash
cargo run --release --bin seed -- --target both --users 100000
```

The seed tool supports flags:

```
--target postgres|mysql|both    # which database to seed
--users N                       # number of users (default: 100000)
--orders-per-user N             # avg orders per user (default: 10)
--events-per-user N             # avg events per user (default: 50)
--batch-size N                  # insert batch size (default: 1000)
--pg-url URL                    # PostgreSQL connection URL
--mysql-url URL                 # MySQL connection URL
```

### Toolchain

The project pins Rust **1.94** via `rust-toolchain.toml`. Install with:

```bash
rustup install 1.94
```

### Running Tests

```bash
cargo test              # 617 unit + integration tests (no database needed)
cargo test -- --nocapture  # with output
cargo clippy --all-targets -- -D warnings  # lint check
cargo fmt --all -- --check                 # format check
```

End-to-end scripts (Docker Compose must be up, `rivet` built):

```bash
bash dev/test_permissions.sh
bash dev/test_schema_evolution.sh
```

### CI

GitHub Actions runs on every push/PR to `master`/`main`:

- **Rustfmt** — formatting check
- **Clippy** — lint check with `-D warnings`
- **Tests** — full test suite
- **Release build** — ensures `cargo build --release` succeeds
- **Security audit** — `cargo audit` via `rustsec/audit-check`

---

## Roadmap

See [rivet_roadmap.md](rivet_roadmap.md) for the full roadmap (strategy + execution status).

**Next milestones:**

| Milestone | Focus |
|-----------|-------|
| **v0.2.0** (stable) | Cross-platform release binaries, E2E test matrix, `cargo publish`, Docker image |
| **v0.3.0** | Source count reconciliation, crash/recovery tests, data shape drift detection, curated example configs |
| **Future** | CDC mode, Iceberg/Delta output, webhook destination, multi-source joins, plugin system |