# Chunked Export Mode
## When to use
Use `mode: chunked` when the table is too large for a single `full` export. Rivet splits the data into ranges by a numeric ID column and can process multiple chunks in parallel. Best for:
- Tables with millions or billions of rows
- Tables with a numeric primary key (`BIGINT`, `SERIAL`)
- When you need parallel extraction to save time
- Initial loads of large tables
## Required fields
- `chunk_column` -- a numeric, date, or timestamp column to partition by (typically the primary key). **Auto-resolved from the primary key** if you use the `table: schema.name` shortcut on a Postgres source — log line: `auto-resolved chunk_column = 'id' from primary key on public.orders`.
## Chunking strategies — pick one
Four ways to slice the table. They differ in how chunk boundaries are computed; everything below the strategy line (parallel, checkpoint, retry) is orthogonal and combines with any of them.
| **Fixed size** (default) | `chunk_size: 100000` | `SELECT MIN, MAX` → `[min..min+N)`, `[min+N..min+2N)`, ... — `N` rows of range per chunk | Dense numeric PK, predictable size budget per chunk | `chunk_size_memory_mb` |
| **Fixed count** | `chunk_count: 16` | Range divided into exactly `N` equal slices; per-chunk size derived dynamically | You want exactly *N* workers / files (e.g. = CPU cores) | `chunk_dense`, `chunk_by_days` |
| **Dense** | `chunk_dense: true` | `ROW_NUMBER() OVER (ORDER BY chunk_column)` instead of range; guarantees equal **row count** per chunk regardless of gaps | Sparse IDs — UUIDs as `BIGINT`, deleted rows, hashed keys | `chunk_by_days` |
| **Date-native** | `chunk_by_days: 365` | `chunk_column` must be `DATE` / `TIMESTAMP` / `TIMESTAMPTZ`; windows of N days with `>= AND <` (open-end) semantics | Time-series, event logs, historical backfills by period | `chunk_dense` |
| **Memory-target** (PG only) | `chunk_size_memory_mb: 256` | Auto-computes `chunk_size` from a `pg_class` / `reltuples` row-size estimate; clamped to `[10_000, 5_000_000]` rows. Requires `table:` shortcut | You want to budget by megabytes, not rows; wide tables where row-width is hard to guess | explicit `chunk_size` |
**Orthogonal options that combine with any strategy:**
| `parallel: N` | Up to `N` chunks execute concurrently (separate DB connections) |
| `chunk_checkpoint: true` | Per-chunk row in state DB → `rivet run --resume` skips completed chunks after a crash |
| `chunk_max_attempts: 3` | Retry failed chunks up to N times before bailing the run |
**Each chunk runs a query of the form:** `SELECT … FROM (<base_query>) WHERE <chunk_column> >= $lo AND <chunk_column> < $hi`. The exact rendering for dense and date-native variants is documented further down.
## Minimal config
```yaml
source:
type: postgres
url: "postgresql://user:pass@host:5432/dbname"
exports:
- name: orders_chunked
query: "SELECT id, user_id, product, price, status, ordered_at FROM orders"
mode: chunked
chunk_column: id # numeric column to split ranges on
chunk_size: 100000 # rows per chunk (default: 100,000)
parallel: 4 # concurrent chunk workers
format: parquet
destination:
type: local
path: ./output
```
Output files: one per chunk, e.g. `orders_chunked_20260406_120000_chunk0.parquet`
## Run it
```bash
# Preflight — shows chunk plan (how many chunks, range distribution)
rivet check --config large_table.yaml
# Run with validation and reconciliation
rivet run --config large_table.yaml --validate --reconcile
```
## Progress bar (chunked exports)
In **`mode: chunked`**, Rivet shows a **terminal progress bar** while chunks run: export name, `current/total` chunks, running row count, elapsed time, and ETA. It appears when stderr is an **interactive TTY** (a normal terminal window). The bar does **not** depend on **`RUST_LOG`** (that variable only controls `env_logger` text lines). In CI, or when you **pipe or redirect** stderr, the bar is usually suppressed — then set **`RUST_LOG=info`** (or `debug`) to follow progress in the log instead.

The GIF above was recorded with `RUST_LOG=info` on a 50,000-row fixture (10 chunks of 5,000) so the per-chunk log line `export 'events': chunk N/10 (...)` and the final summary are both visible. On a real interactive terminal you would see the progress bar instead; the log lines appear when stderr is captured.
Use a **small `chunk_size`** relative to your table if you want many steps on the bar (each finished chunk advances it once). **`parallel: 1`** still updates the bar after each sequential chunk.
**Ready-made example in this repo:** [`dev/scenarios/chunked_postgres_bench.yaml`](../../dev/scenarios/chunked_postgres_bench.yaml) includes **`bench_content_p4_safe`**: PostgreSQL `content_items` with **`parallel: 4`** and **`tuning.profile: safe`** (good for trying the bar on a wide table without hammering the source). Other exports in the same file cover serial / highly parallel / fatchunk / balanced profiles.
```bash
# From repo root; Postgres up + seeded (e.g. docker compose + cargo run --bin seed ...)
mkdir -p dev/output/bench
rivet check --config dev/scenarios/chunked_postgres_bench.yaml
rivet run --config dev/scenarios/chunked_postgres_bench.yaml --export bench_content_p4_safe
# Optional: RUST_LOG=info for more log detail; RUST_LOG=warn to reduce log noise (bar unchanged in a TTY)
```
## What happens
1. Rivet queries `SELECT MIN(id), MAX(id) FROM orders` to determine the range
2. Splits into chunks: `[min..min+chunk_size)`, `[min+chunk_size..min+2*chunk_size)`, ...
3. Each chunk runs independently: `SELECT ... WHERE id >= $lo AND id < $hi`
4. With `parallel: 4`, up to 4 chunks execute concurrently
5. Each chunk writes a separate output file
## Chunk checkpoint (resume after crash)
For very large exports, enable checkpointing so you can resume from where you left off:
```yaml
exports:
- name: orders_chunked
query: "SELECT id, user_id, product, price, ordered_at FROM orders"
mode: chunked
chunk_column: id
chunk_size: 100000
parallel: 4
chunk_checkpoint: true # persist progress per chunk
chunk_max_attempts: 3 # retry failed chunks up to 3 times
format: parquet
destination:
type: local
path: ./output
```
Resume after a crash:
```bash
# Resume only processes incomplete chunks
rivet run --config large_table.yaml --resume
# View checkpoint status
rivet state chunks --config large_table.yaml --export orders_chunked
# Clear checkpoint (to re-export from scratch)
rivet state reset-chunks --config large_table.yaml --export orders_chunked
```
## Chunk sizing guidance
| 1M rows | 100,000 | 2 |
| 10M rows | 100,000 | 4 |
| 100M+ rows | 200,000-500,000 | 4-8 |
Larger chunks = fewer queries but more memory per batch. Smaller chunks = more queries but lower peak RSS.
## Date-based chunking
When your table's natural partition boundary is **time** rather than a numeric ID, use `chunk_by_days` instead of relying on integer ranges.
```yaml
exports:
- name: orders_by_year
query: "SELECT id, user_id, product, price, ordered_at FROM orders"
mode: chunked
chunk_column: ordered_at # DATE or TIMESTAMP column
chunk_by_days: 365 # one chunk per ~year
format: parquet
destination:
type: local
path: ./output
```
Rivet fetches `MIN` / `MAX` of the column as text, parses the dates, then generates non-overlapping windows:
```sql
-- each chunk window (open-end exclusive):
WHERE ordered_at >= '2023-01-01' AND ordered_at < '2024-01-01'
WHERE ordered_at >= '2024-01-01' AND ordered_at < '2025-01-01'
...
```
The open-end `< end_date` bound is intentional: it correctly captures all `TIMESTAMP` values within the day, including `23:59:59.999…`.
**When to use date chunking over numeric chunking:**
- The table has no dense numeric PK (UUIDs, composite keys)
- You want even partitions by time, not by row count
- The source DB has better statistics / indexes on the timestamp column
- You want to avoid unix-epoch arithmetic that JDBC tools often get wrong
`chunk_by_days` can be combined with `parallel` for concurrent date windows, and supports `chunk_checkpoint` / `--resume` like numeric chunked mode.
`chunk_dense: true` is incompatible with `chunk_by_days` and will be rejected at config validation.
`rivet check` will report the strategy as `date-chunked(ordered_at, 365d)`.
## Sparse ID ranges
If IDs have large gaps (e.g. UUIDs cast to BIGINT, or deleted rows), many chunks may be empty. Use `chunk_dense: true` to use `ROW_NUMBER()` ordering instead:
```yaml
exports:
- name: sparse_table
query: "SELECT id, payload FROM orders_sparse"
mode: chunked
chunk_column: id
chunk_size: 50000
chunk_dense: true # uses ROW_NUMBER() instead of range splitting
format: parquet
destination:
type: local
path: ./output
```
`rivet check` will warn you about sparse ranges.
## Troubleshooting
**Many empty chunks** -- Your ID column has gaps. Add `chunk_dense: true`.
**High memory usage with parallel > 1** -- Reduce `chunk_size` or add `tuning.profile: safe`.
**Export fails midway through 1000 chunks** -- Enable `chunk_checkpoint: true` and re-run with `--resume`.