# Tuning Reference
Tuning controls how Rivet queries the source database: batch sizes, timeouts, throttling, and retries.
## Where to place tuning
Tuning can be set at two levels:
1. **Global** (`source.tuning`) -- applies to all exports
2. **Per-export** (`exports[].tuning`) -- overrides global for that export
Per-export values take precedence. Unset per-export fields fall back to the global value.
```yaml
source:
type: postgres
url_env: DATABASE_URL
tuning:
profile: balanced # global default
batch_size: 10000
exports:
- name: small_table
query: "SELECT * FROM users"
format: parquet
destination: { type: local, path: ./out }
# inherits global tuning (balanced, batch_size=10000)
- name: huge_table
query: "SELECT * FROM events"
format: parquet
destination: { type: local, path: ./out }
tuning:
profile: safe # override for this export only
batch_size: 2000
```
> **Common mistake:** placing `batch_size` directly under `source:` or in the export root instead of under `tuning:`. Rivet will reject such configs with a clear error message.
## Profiles
A profile sets sensible defaults for all tuning parameters. Individual fields override the profile.
| Parameter | `fast` | `balanced` (default) | `safe` |
|-----------|--------|---------------------|--------|
| `batch_size` | 50,000 | 10,000 | 2,000 |
| `throttle_ms` | 0 | 50 | 500 |
| `statement_timeout_s` | 0 (none) | 300 | 120 |
| `max_retries` | 1 | 3 | 5 |
| `retry_backoff_ms` | 1,000 | 2,000 | 5,000 |
| `lock_timeout_s` | 0 (none) | 30 | 10 |
| `memory_threshold_mb` | 0 (none) | 4,096 | 2,048 |
### When to use each profile
| Profile | Use case |
|---------|----------|
| **fast** | Dedicated read replica, off-peak hours, small tables |
| **balanced** | General purpose, shared database, production reads |
| **safe** | Busy production database, OLTP systems, wide tables with large rows |
## All tuning parameters
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `profile` | `fast` \| `balanced` \| `safe` | `balanced` | Base profile (sets defaults for all other fields) |
| `batch_size` | integer | profile default | Number of rows fetched per query batch |
| `batch_size_memory_mb` | integer | — | Target memory per batch in MB (adaptive sizing; mutually exclusive with `batch_size`) |
| `throttle_ms` | integer | profile default | Delay in ms between batches (reduces source load) |
| `statement_timeout_s` | integer | profile default | Database statement timeout in seconds (0 = no timeout) |
| `max_retries` | integer | profile default | Max retry attempts for transient errors |
| `retry_backoff_ms` | integer | profile default | Base delay between retries in ms (exponential backoff) |
| `lock_timeout_s` | integer | profile default | Database lock timeout in seconds (0 = no timeout) |
| `memory_threshold_mb` | integer | profile default | RSS threshold in MB; pauses fetching if exceeded (0 = disabled). `balanced` defaults to 4096, `safe` to 2048, `fast` to 0 (no limit). |
| `max_batch_memory_mb` | integer | — | Hard cap on a single Arrow batch in MB. When exceeded, `on_batch_memory_exceeded` determines the response. |
| `on_batch_memory_exceeded` | `warn` \| `fail` \| `auto_shrink` | `warn` | Policy applied when a batch exceeds `max_batch_memory_mb`. |
### Batch memory cap (`max_batch_memory_mb`)
`memory_threshold_mb` is a process-level RSS guard — it fires after the OS has already committed memory. `max_batch_memory_mb` is an earlier, batch-level guard: it measures the actual Arrow buffer footprint of each batch before it is written.
```yaml
tuning:
max_batch_memory_mb: 128
on_batch_memory_exceeded: warn # warn | fail | auto_shrink
```
| `warn` | **(default)** Log a warning with the actual size, the limit, and a suggested `batch_size`. Continue the export. |
| `fail` | Return an error immediately. The export stops. Use in strict pipelines where oversized batches indicate a configuration problem. |
| `auto_shrink` | Split the oversized batch in half recursively until each sub-batch fits within the limit, then write the sub-batches individually. Transparent to the rest of the pipeline — total row count and output are identical. |
The warning and error messages include a suggested `batch_size`:
```
batch memory 184 MB exceeds max_batch_memory_mb=128 MB (5000 rows).
Consider lowering batch_size to ~3478.
```
Use `auto_shrink` when you want protection against accidental wide-table OOM without needing to tune `batch_size` manually. Use `fail` in CI pipelines where any oversized batch should block the run.
## Choosing `batch_size`
`batch_size` is the most impactful parameter for both performance and memory usage.
| batch_size | Memory per batch (narrow table) | Memory per batch (wide table) | Best for |
| 5,000 | ~5-25 MB | ~100-500 MB | Medium tables, shared databases |
| 10,000 | ~10-50 MB | ~200 MB - 1 GB | General purpose (default balanced) |
| 50,000 | ~50-250 MB | ~1-5 GB | Read replicas, fast profile |
For wide tables (50+ columns, TEXT/JSONB fields), start with `batch_size: 1000-2000`.
## Adaptive batch sizing
Instead of a fixed row count, let Rivet adjust batch size based on memory:
```yaml
tuning:
batch_size_memory_mb: 64 # target ~64 MB per batch
```
Rivet samples the first batch to estimate row size, then adjusts subsequent batches. Cannot be used together with `batch_size`.
## Memory optimization tips
1. **Reduce `batch_size`** -- the single most effective knob
2. **Use `safe` profile** for wide tables on production databases
3. **Enable jemalloc** -- build with `--features jemalloc` for 20-40% lower RSS
4. **Set `memory_threshold_mb`** -- Rivet pauses fetching when RSS exceeds this
## Examples
### Minimal (use defaults)
```yaml
source:
type: postgres
url_env: DATABASE_URL
# No tuning block → balanced profile with all defaults
```
### Aggressive (read replica)
```yaml
source:
type: postgres
url_env: REPLICA_URL
tuning:
profile: fast
batch_size: 100000
throttle_ms: 0
```
### Conservative (production OLTP)
```yaml
source:
type: postgres
url_env: DATABASE_URL
tuning:
profile: safe
batch_size: 1000
throttle_ms: 1000
statement_timeout_s: 60
memory_threshold_mb: 512
```
---
## Capacity and memory planning
### Peak RSS formula
```
peak_rss ≈ batch_size × avg_row_bytes × parallel_workers
```
Add ~50–150 MB overhead for the Tokio runtime, the source connection pool, jemalloc bookkeeping, and the OS page cache on the temp file.
### Rule of thumb by table width
| Narrow (IDs, timestamps, small text) | ~100 B | 50 000–100 000 | ~50–200 MB |
| Medium (mixed text, JSON) | ~1 KB | 10 000–25 000 | ~50–250 MB |
| Wide (TEXT/JSONB payloads ≥ 10 KB avg) | ~10 KB | 500–2 000 | ~50–200 MB |
Use the `safe` profile for wide tables — it caps `batch_size` at 500 automatically.
### How `memory_threshold_mb` works
When `tuning.memory_threshold_mb` is set, Rivet samples RSS after each batch (via `mach_task_basic_info` on macOS, `/proc/self/statm` on Linux). If RSS exceeds the threshold, fetching pauses until RSS drops below 80 % of the limit. This prevents OOM on tables with highly variable row widths.
```yaml
source:
tuning:
memory_threshold_mb: 1024 # pause fetching above 1 GB RSS
```
The guard adds ~1–2 ms of overhead per batch from the RSS syscall. It is enabled by default on `balanced` (4096 MB) and `safe` (2048 MB) profiles; set `memory_threshold_mb: 0` to disable it.
### Parallelism and source capacity
Each parallel chunk worker opens its own source connection. Postgres `max_connections` is typically 100–200 for shared instances and 20–50 for read replicas. `rivet check` warns when `parallel >= max_connections`.
Safe upper bound: `parallel ≤ max_connections / 4` to leave headroom for application traffic.
### Per-export memory isolation
`--parallel-export-processes` spawns one OS process per export — each export has its own allocator and heap, so peak RSS is per-export rather than aggregate. Use this mode when running many wide-table exports at once on memory-constrained hosts.