rivet-cli 0.9.2

# Complete YAML Config Reference

Every field Rivet accepts in a config YAML, grouped by section.

---

## Root

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `source` | object | **yes** | Database connection and global tuning |
| `exports` | list | **yes** | One or more export definitions |
| `notifications` | object | no | Slack / webhook notification settings |

---

## `source`

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `type` | `postgres` \| `mysql` \| `mssql` | **yes** | — | Database type. `mssql` = SQL Server (URL scheme `sqlserver://`). |
| `url` | string | one of url/url_env/url_file or structured | — | Full connection URL (`postgresql://` / `mysql://` / `sqlserver://`) |
| `url_env` | string | | — | Env var name containing the URL |
| `url_file` | string | | — | Path to file containing the URL |
| `host` | string | for structured | — | Database hostname |
| `port` | integer | no | `5432` (PG) / `3306` (MySQL) / `1433` (MSSQL) | Database port |
| `user` | string | for structured | — | Database user |
| `password` | string | no | — | **Not recommended** — plaintext; see [Credentials & plan artifacts](#credentials--plan-artifacts) below |
| `password_env` | string | no | — | Env var name containing the password (recommended) |
| `database` | string | for structured | — | Database name |
| `tuning` | object | no | — | Global tuning (see [tuning.md](tuning.md)) |
| `tls` | object | no | — | Transport security (see [TLS](#tls) below). Omit → plaintext + WARN log. |

**Connection approaches** (mutually exclusive):

1. **URL-based**: provide exactly one of `url`, `url_env`, or `url_file`
2. **Structured**: provide `host`, `user`, `database` (+ optional `port`, `password`/`password_env`)

### TLS

| Field | Type | Default | Description |
|---|---|---|---|
| `mode` | `disable` \| `require` \| `verify-ca` \| `verify-full` | `verify-full` | Enforcement level (mirrors libpq `sslmode` semantics) |
| `ca_file` | string | — | PEM-encoded CA certificate for private trust stores; required for `verify-ca`/`verify-full` against custom CAs |
| `accept_invalid_certs` | boolean | `false` | Dangerous — disables certificate verification. Only honored when explicitly `true`. |
| `accept_invalid_hostnames` | boolean | `false` | Dangerous — disables hostname (SAN/CN) verification. Only honored when explicitly `true`. |

Example (production):

```yaml
source:
  type: postgres
  url_env: DATABASE_URL
  tls:
    mode: verify-full
    ca_file: /etc/ssl/certs/rds-ca-2019-root.pem
```

Example (local dev only — no TLS):

```yaml
source:
  type: mysql
  host: 127.0.0.1
  port: 3306
  user: dev
  password_env: DEV_PWD
  database: rivet
  tls: { mode: disable }       # explicit opt-out — silences the plaintext WARN
```

Example (SQL Server — `sqlserver://` scheme, port 1433):

```yaml
source:
  type: mssql
  url_env: MSSQL_URL           # sqlserver://user:pass@host:1433/database
  tls:
    ca_file: /etc/ssl/certs/your-sql-server-ca.pem   # private CA, or:
    # accept_invalid_certs: true                      # self-signed dev cert
```

SQL Server always encrypts the login handshake, so TLS is on regardless; the
`tls:` block only controls how the server certificate is trusted. Supported
export modes and types are listed in [compatibility.md](compatibility.md#sql-server-mssql--current-scope).

When `tls:` is omitted entirely, Rivet connects without TLS and emits a WARN so you notice. See [reference/compatibility.md](compatibility.md) for which servers ship TLS-ready and Rivet's dev-environment defaults.

### Credentials & plan artifacts

A `PlanArtifact` (produced by `rivet plan`) is designed to be committed / reviewed; it **must not** carry plaintext credentials. Rivet enforces ADR-0005 **PA9** (`SourceConfig::redact_for_artifact`):

- `password:` field → always stripped from the artifact (set to `None`).
- `url:` containing `scheme://user:pass@…` → userinfo rewritten to `REDACTED`.
- `password_env` / `url_env` / `url_file` → preserved as **references** so apply-time can re-resolve against the apply-environment.

When redaction runs, `rivet plan` logs:

```
WARN plan 'orders': plaintext credentials stripped from artifact —
     apply time must have equivalent env/file-based auth available
```

**Recommendation**: use `password_env` (or `url_env`) everywhere; only use plaintext `password:` for one-off local scripts. See [ADR-0005 PA9](../adr/0005-plan-apply-contracts.md#pa9--artifact-credential-redaction-acr).

---

## `exports[]`

Each entry in the `exports` list defines one export job.

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `name` | string | **yes** | — | Unique identifier for this export |
| `query` | string | one of query/query_file | — | SQL SELECT query |
| `query_file` | string | | — | Path to `.sql` file (relative to config dir) |
| `mode` | `full` \| `incremental` \| `chunked` \| `time_window` | no | `full` | Export mode |
| `format` | `parquet` \| `csv` | **yes** | — | Output format |
| `compression` | `zstd` \| `snappy` \| `gzip` \| `lz4` \| `none` | no | `zstd` | Compression codec (low-level; prefer `compression_profile`) |
| `compression_level` | integer | no | codec default | Compression level (low-level; prefer `compression_profile`) |
| `compression_profile` | `none` \| `fast` \| `balanced` \| `compact` | no | — | High-level preset — overrides `compression` and `compression_level`. See [Compression profiles](#compression-profiles) below. |
| `destination` | object | **yes** | — | Where to write output (see below) |
| `verify` | `size` \| `content` | no | `size` | Integrity depth required of `--validate`. `content` checks every part's MD5 against the store's listing (no download) and **fails** validation for any part only size-verified — e.g. a part too large to upload as a single PUT (lower `max_file_size` so it fits) or a backend that exposes no checksum (local FS, streamed multipart). See [Verification depth](#verification-depth) below. |
| `skip_empty` | boolean | no | `false` | Skip file creation if 0 rows |
| `max_file_size` | string | no | — | Split output: `"256MB"`, `"1GB"`, etc. |
| `meta_columns` | object | no | — | Extra columns added to output |
| `quality` | object | no | — | Data quality checks |
| `tuning` | object | no | — | Per-export tuning overrides |
| `source_group` | string | no | — | Logical group for shared source capacity (replica, host). Drives campaign-level warnings in `rivet plan` (advisory only — ADR-0006) |
| `reconcile_required` | boolean | no | `false` | Advisory hint: treat this export as reconcile-sensitive in planning, independent of the `--reconcile` CLI flag (ADR-0006, Epic C) |
| `columns` | map | no | — | Per-column type overrides (see below) |
| `on_schema_drift` | `warn`\|`continue`\|`fail` | no | `warn` | Policy when structural schema drift is detected (see below) |
| `shape_drift_warn_factor` | float | no | `2.0` | Warn when a string/binary column's max byte length grows beyond `N × stored_max`. Set to `0` to disable shape tracking. |
| `parquet` | object | no | — | Parquet row group tuning (Parquet format only). See [Parquet row group tuning](#parquet-row-group-tuning) below. |

### Compression profiles

`compression_profile` is the recommended way to pick a codec. It maps to a `(codec, level)` pair and takes precedence over any `compression` / `compression_level` fields.

| Profile | Codec | Level | Best for |
|---|---|---|---|
| `none` | no compression | — | Debug, local scratch, fast iteration |
| `fast` | snappy | — | Backfills, pilot runs, low-CPU environments |
| `balanced` | zstd | 3 | **Default for production** — good ratio, moderate CPU |
| `compact` | zstd | 9 | Storage- or network-cost-sensitive pipelines |

```yaml
exports:
  - name: events
    format: parquet
    compression_profile: balanced    # zstd level 3
    destination: { type: local, path: ./out }
```

If you need a specific codec that is not covered by the presets, use `compression` + `compression_level` directly and omit `compression_profile`.

---

### Verification depth

`verify` controls how thoroughly `--validate` (and `rivet validate`) checks each
part at the destination:

- **`size`** (default) — confirm each part exists at its recorded `size_bytes`,
  plus manifest self-consistency and `_SUCCESS`. Content is also MD5-checked for
  free whenever the store surfaces a checksum in its listing, but a part without
  one is accepted as size-only.
- **`content`** — require every part's content MD5 to match the store's listing
  checksum (no download). Any part that could only be size-verified **fails**
  validation with an actionable message.

How content verification works: Rivet computes each part's MD5 before upload and
records it in the manifest; the store computes its own (GCS `md5Hash`, S3/Azure
single-PUT ETag) and returns it in object listings. `--validate` compares the two
with **no download**. Parts that upload as a single PUT get a checksum; parts
large enough to stream as multipart / block-list do not — so under
`verify: content`, lower `max_file_size` until parts fit a single PUT (≤ ~64 MiB
by default), or use a backend that exposes a checksum (local FS never does).

The run report and `rivet validate` show coverage explicitly, e.g.
`3 verified (2 md5, 1 size-only)`.

---

### Parquet row group tuning

Parquet row groups affect memory usage during write, compression ratio, and downstream query performance (predicate pushdown, column skipping). When `parquet:` is omitted, Rivet uses the library default of 1,048,576 rows per group, which is optimal for narrow tables but can be large for wide tables.

```yaml
exports:
  - name: events
    format: parquet
    parquet:
      row_group_strategy: auto          # auto | fixed_rows | fixed_memory
      target_row_group_mb: 128          # target Arrow buffer size per group (auto + fixed_memory)
      max_row_group_mb: 256             # optional upper bound (all strategies)
```

| Field | Type | Default | Description |
|---|---|---|---|
| `row_group_strategy` | `auto` \| `fixed_rows` \| `fixed_memory` | `auto` | How to determine row group size |
| `row_group_rows` | integer | — | Exact rows per group; used with `fixed_rows` only |
| `target_row_group_mb` | integer | `128` | Target Arrow buffer per group in MB; used with `auto` and `fixed_memory` |
| `max_row_group_mb` | integer | — | Hard upper bound on group memory in MB (all strategies) |

| Strategy | Behavior |
|---|---|
| `auto` | Estimates row width from schema column types, computes rows-per-group to hit `target_row_group_mb`. Narrow tables get large groups; wide tables get smaller groups. |
| `fixed_rows` | Use `row_group_rows` exactly. Simple and deterministic, but does not adapt to row width. |
| `fixed_memory` | Same math as `auto` (target / estimated row bytes), but the strategy name is explicit in logs. |

**Examples:**

```yaml
# Auto-tune for a wide JSON table — groups sized to ~64 MB
parquet:
  row_group_strategy: auto
  target_row_group_mb: 64
  max_row_group_mb: 128

# Fixed row count — useful when downstream tooling requires exact group sizes
parquet:
  row_group_strategy: fixed_rows
  row_group_rows: 500000
```

> **Note:** `rivet plan` shows the selected strategy and target in the Format section when `parquet:` is configured.
>
> **`rivet init` auto-generates this block** for chunked exports and large full-mode tables, pre-selecting `target_row_group_mb: 64` for wide schemas (≥ 5 text/JSON/bytea columns) and `128` for narrow ones.

---

### `exports[].on_schema_drift` — schema drift policy

Controls what Rivet does when it detects a structural change in the output schema (column added, removed, or retyped) compared to the snapshot stored from the previous run.

| Value | Behavior |
|---|---|
| `warn` | **(default)** Log a warning, store the new schema fingerprint, and continue the run. |
| `continue` | Silently accept — store the new schema, no log output. |
| `fail` | Abort the run with exit code 1. The schema store is **not** updated, so the next run will detect the same change again. |

`fail` is useful in CI pipelines where schema changes must be reviewed before the new shape is exported downstream.

```yaml
exports:
  - name: orders
    on_schema_drift: fail
```

When `fail` triggers, the output file has already been written to the destination (schema check happens post-extraction), but no cursor advance or manifest commit occurs. Re-run after confirming the schema change is intentional, or switch to `warn` to accept it.

---

### `exports[].columns` — per-column type overrides

Override the Arrow type Rivet infers for a specific column. Useful when:
- a `NUMERIC` / `DECIMAL` column has no explicit precision/scale in the source schema (beyond `rivet init`'s default `decimal(38,18)` placeholder), or
- you need a narrower precision for BigQuery NUMERIC compatibility.

```yaml
columns:
  <column_name>: decimal(<precision>,<scale>)
```

`decimal(p,s)` is the only supported override today. Both precision and scale are required.

Example:

```yaml
exports:
  - name: orders
    query: "SELECT id, amount, fee FROM orders"
    format: parquet
    destination:
      type: local
      path: ./out
    columns:
      amount: decimal(18,2)
      fee: decimal(18,6)
```

**`rivet init` generates these automatically.** When introspecting a table, `rivet init` reads `numeric_precision` and `numeric_scale` from `information_schema.columns`. If both are present, it emits a concrete override (`decimal(p,s)`). If the column is unbounded (`NUMERIC` without explicit precision), `rivet init` emits a **working default** `decimal(38,18)` plus a **`# REVIEW:`** YAML comment — the config header adds a **`# NOTE:`** line, and `rivet init -o …` prints a stderr reminder so you tighten precision when you know the real domain rules:

```yaml
    columns:
      price: decimal(38,18)  # REVIEW: DDL has no numeric(p,s); edit to the real decimal(p,s) …
```

Type overrides are applied at export time and are reflected in `rivet check --type-report` output.

---

Some PostgreSQL types have no Arrow representation and cannot be exported directly. Rivet will report an error listing all unmappable columns before the run starts.

| PostgreSQL type | Reason | Workaround |
|---|---|---|
| `geometry` (PostGIS) | No Arrow equivalent | Cast to text: `ST_AsText(col) AS col` in your query |
| `geography` (PostGIS) | No Arrow equivalent | Cast to text: `ST_AsText(col) AS col` |
| `hstore` | No Arrow equivalent | Cast to JSON text: `hstore_to_json(col)::text AS col` |
| `tsvector`, `tsquery` | No Arrow equivalent | Cast to text: `col::text AS col` |
| `point`, `line`, `polygon`, etc. | No Arrow equivalent | Cast to text: `col::text AS col` |

Use a SQL expression in your `query` field to work around any unsupported type:

```yaml
exports:
  - name: locations
    query: >
      SELECT id, name, ST_AsText(geom) AS geom_wkt
      FROM locations
    format: parquet
    destination:
      type: local
      path: ./out
```

Rivet exports the WKT text as a `Utf8` (string) column. Downstream tools (DuckDB, GeoPandas, QGIS) can reconstruct geometry from WKT.

---

### Mode-specific fields

**Incremental** (`mode: incremental`):

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `cursor_column` | string | **yes** | — | Primary progression column (should be monotonically increasing) |
| `cursor_fallback_column` | string | when `coalesce` | — | Fallback column used when primary is `NULL`. Only valid with `incremental_cursor_mode: coalesce` |
| `incremental_cursor_mode` | `single_column` \| `coalesce` | no | `single_column` | `coalesce` progresses on `COALESCE(primary, fallback)`. See [modes/incremental-coalesce.md](../modes/incremental-coalesce.md) and [ADR-0007](../adr/0007-cursor-policy-contracts.md). |

**Chunked** (`mode: chunked`):

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `chunk_column` | string | yes* | — | Numeric or date/timestamp column to partition by. *Required unless `chunk_by_key` is set (mutually exclusive). |
| `chunk_by_key` | string | yes* | — | Single index-backed UNIQUE NOT NULL column for **keyset (seek)** pagination — the source-safe shape for tables with no single-integer PK (UUID / string / composite). Requires the `table:` shortcut; mutually exclusive with `chunk_column`. See [chunked modes](../modes/chunked.md) and [ADR-0020](../adr/0020-pg-uuid-pk-chunking-asymmetry.md). |
| `chunk_size` | integer | no | `100000` | Rows per chunk (numeric mode), or page size for keyset. Ignored when `chunk_count` is set. |
| `chunk_size_memory_mb` | integer | no | — | Target memory budget per chunk in MB; `chunk_size` is derived from a `pg_class` row-size estimate (`pg_relation_size / reltuples`), clamped to `[10000, 5000000]` rows. **PostgreSQL only**, requires the `table:` shortcut, mutually exclusive with an explicit non-default `chunk_size:`. |
| `chunk_count` | integer | no | — | Divide the column range into exactly this many equal chunks. `chunk_size` is computed dynamically from `min`/`max`. Must be ≥ 1. Mutually exclusive with `chunk_dense` and `chunk_by_days`. |
| `chunk_by_days` | integer | no | — | Enable date chunking: window size in days. Mutually exclusive with `chunk_dense` and `chunk_count`. |
| `parallel` | integer | no | `1` | Concurrent chunk workers |
| `chunk_dense` | boolean | no | `false` | Use `ROW_NUMBER()` for sparse numeric IDs. Mutually exclusive with `chunk_by_days` and `chunk_count`. |
| `chunk_checkpoint` | boolean | no | `false` | Persist per-chunk progress for resume |
| `chunk_max_attempts` | integer | no | — | Max retry attempts per chunk |

**Time-window** (`mode: time_window`):

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `time_column` | string | **yes** | — | Timestamp column to filter on |
| `time_column_type` | `timestamp` \| `unix` | no | `timestamp` | Column type |
| `days_window` | integer | **yes** | — | Rolling window size in days |

---

## `exports[]` — value-based partitioning

Orthogonal to `mode`: splits rows into Hive-style `col=value/` destination
sub-folders by a date column. See [partitioning.md](../partitioning.md).

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `partition_by` | string | no | — | Date/timestamp column to bucket rows by. Requires a `{partition}` token in `destination.path`/`prefix`. NULLs → `col=__HIVE_DEFAULT_PARTITION__/`. Not compatible with `mode: time_window`. |
| `partition_granularity` | `day` \| `month` \| `year` | no | `day` | Bucket width. |

---

## `exports[].meta_columns`

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `exported_at` | boolean | `false` | Add `_rivet_exported_at` column (Timestamp UTC; one value per batch) |
| `row_hash` | boolean | `false` | Add `_rivet_row_hash` column — lower 64 bits of `xxHash3-128` over the row, written as `Int64` for fast `PARTITION BY` / `JOIN`. Deterministic across runs; distinguishes NULL from empty string. |

---

## `exports[].quality`

| Field | Type | Description |
|-------|------|-------------|
| `row_count_min` | integer | Fail if fewer rows exported |
| `row_count_max` | integer | Fail if more rows exported |
| `null_ratio_max` | map (column → float) | Fail if null ratio exceeds threshold |
| `unique_columns` | list of strings | Fail if values are not unique |
| `unique_max_entries` | integer | Cap on distinct values tracked per column during uniqueness checks. When reached, a `Warn` is emitted and checking stops for that column. Prevents unbounded memory growth on high-cardinality columns (UUIDs, email addresses, event IDs). |

Uniqueness tracking uses typed xxHash3-64 internally — numeric and binary columns are hashed directly from raw bytes without string formatting. `unique_max_entries` is the primary knob to control memory on very large tables.

Example:

```yaml
quality:
  row_count_min: 100
  null_ratio_max:
    email: 0.05          # email must be <5% null
  unique_columns:
    - id
    - email
  unique_max_entries: 1000000   # stop after 1M unique values; warn if limit hit
```

**Without `unique_max_entries`** — tracking is unbounded. Safe for tables with hundreds of thousands of rows; may use significant RAM on tables with tens or hundreds of millions of distinct values.

**With `unique_max_entries`** — tracking stops at the limit. The export succeeds but the run summary shows a warning. Use when you want a best-effort uniqueness check without memory risk.

---

## `exports[].destination`

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | `local` \| `s3` \| `gcs` \| `stdout` | **yes** | Destination type |

### Local

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `path` | string | **yes** | Output directory |

### S3

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `bucket` | string | **yes** | S3 bucket name |
| `prefix` | string | no | Key prefix |
| `region` | string | **yes** | AWS region |
| `endpoint` | string | no | Custom S3 endpoint (MinIO, R2) |
| `access_key_env` | string | no | Env var for access key |
| `secret_key_env` | string | no | Env var for secret key |
| `aws_profile` | string | no | AWS credentials profile name |
| `allow_anonymous` | boolean | no | Skip authentication |

### GCS

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `bucket` | string | **yes** | GCS bucket name |
| `prefix` | string | no | Object prefix |
| `credentials_file` | string | no | Path to service account JSON (otherwise ADC / `GOOGLE_APPLICATION_CREDENTIALS`) |
| `endpoint` | string | no | Custom GCS endpoint (fake-gcs-server, test doubles) |
| `allow_anonymous` | boolean | no | Skip authentication (public bucket / emulator) |

### Stdout

No additional fields. Only `type: stdout` is needed.

### Path and prefix placeholders

The `path` (local) and `prefix` (S3 / GCS) fields support template placeholders, substituted at plan-build time:

| Placeholder | Value |
|---|---|
| `{date}` | UTC date as `YYYY-MM-DD` |
| `{export}` | Export name from config |
| `{table}` | Alias for `{export}` |

```yaml
destination:
  type: s3
  bucket: my-data
  prefix: exports/{date}/{export}/
  region: us-east-1
```

With an export named `orders` running on 2026-05-14, this resolves to `exports/2026-05-14/orders/`.

---

## `notifications`

| Field | Type | Description |
|-------|------|-------------|
| `slack` | object | Slack notification config |

### `notifications.slack`

| Field | Type | Description |
|-------|------|-------------|
| `webhook_url` | string | Slack incoming webhook URL |
| `webhook_url_env` | string | Env var containing webhook URL |
| `on` | list | Events to notify on: `failure`, `schema_change`, `degraded` |

Example:

```yaml
notifications:
  slack:
    webhook_url_env: SLACK_WEBHOOK
    on: [failure, schema_change]
```

---

## Environment variable interpolation

Any string value can reference environment variables:

```yaml
source:
  url: "postgresql://${DB_USER}:${DB_PASS}@${DB_HOST}:5432/mydb"
```

## Query parameters

Queries can use `${key}` placeholders filled by `--param key=value`:

```yaml
exports:
  - name: filtered
    query: "SELECT * FROM orders WHERE region = '${region}'"
```

```bash
rivet run --config export.yaml --param region=us-east
```