rivet-cli 0.16.3

# `rivet init` — config scaffolding

`rivet init` connects to PostgreSQL or MySQL, introspects tables, and prints a **YAML scaffold** you can save and edit before running `rivet check` / `rivet run`.

Generated configs use `url_env: DATABASE_URL` so secrets are not embedded in the file. Set `DATABASE_URL` (or switch to `url:` / structured credentials) before running exports.

---

## Modes

### Single table

![rivet init generating a YAML for a single table](../gifs/init-scaffold.gif)

Provide `--table` (optionally `schema.table` on PostgreSQL).

```bash
export DATABASE_URL='postgresql://user:pass@localhost:5432/mydb'
rivet init --source "$DATABASE_URL" --table orders -o rivet.yaml

# Qualified name (PostgreSQL)
rivet init --source "$DATABASE_URL" --table analytics.facts -o rivet.yaml
```

```bash
export DATABASE_URL='mysql://user:pass@localhost:3306/mydb'
rivet init --source "$DATABASE_URL" --table orders -o rivet.yaml
```

Rivet emits **one** export block: `SELECT` of all columns, a suggested `mode` (`full`, `incremental`, or `chunked`) from row estimates and column types, plus `chunk_*` or `cursor_column` when applicable. Every scaffold uses **`format: parquet`** and, by default, **`meta_columns` with `exported_at: true` and `row_hash: true`** (lineage and row fingerprinting in the output — see [`exports[].meta_columns` in config](config.md)). When the heuristic picks **`chunked`**, the scaffold also includes **`chunk_checkpoint: true`** (resumable runs, `rivet run --resume`, and reconcile/repair — see [chunked mode](../modes/chunked.md)).

### Whole PostgreSQL schema

Omit `--table`. All **base tables and views** in the target schema are introspected; the file contains **one export per object**, sorted by name.

- **`--schema`** — PostgreSQL schema name (default: `public`).

```bash
export DATABASE_URL='postgresql://user:pass@localhost:5432/mydb'
rivet init --source "$DATABASE_URL" --schema public -o rivet_all_public.yaml

# Non-default schema
rivet init --source "$DATABASE_URL" --schema analytics -o rivet_analytics.yaml
```

The database itself comes from the connection URL path (`/mydb`).

### Whole MySQL database

Omit `--table`. All **base tables and views** in the database are listed from `information_schema`.

- If the URL already includes the database (`mysql://.../mydb`), that database is used.
- If the URL has **no** database path, pass **`--schema <database>`** (same flag name as for Postgres; on MySQL it selects the database name for listing).

```bash
rivet init --source 'mysql://user:pass@localhost:3306/rivet' -o rivet_mysql.yaml

# URL without database — name it explicitly
rivet init --source 'mysql://user:pass@localhost:3306/' --schema rivet -o rivet_mysql.yaml
```

---

## Heuristics (suggested `mode`)

| Condition | Suggested mode |
|-----------|----------------|
| Estimated rows ≤ 100k | `full` |
| Rows > 100k and integer PK (or first integer column) | `chunked` with `chunk_column` / `chunk_size`, **`chunk_checkpoint: true` by default**, and sometimes `parallel` |
| Rows > 100k, no suitable integer chunk key, but a timestamp column | `incremental` with `cursor_column` (`updated_at` / `created_at` preferred) |

### Chunked exports and checkpointing

When the suggested mode is **`chunked`**, the scaffold always includes **`chunk_checkpoint: true`**. That enables resumable chunk runs after crashes or transient errors (`rivet run --resume`), chunk state in `rivet state chunks`, and reconcile/repair workflows. Set it to `false` only if you intentionally do not want checkpoint state on disk.

### Meta columns (defaults)

The YAML scaffold enables **`exported_at`** and **`row_hash`** for every export. Set either to `false`, or remove the `meta_columns` block entirely, to turn them off.

Row estimates are cheap metadata (`pg_class.reltuples` on PostgreSQL, `information_schema.TABLES.TABLE_ROWS` on MySQL), not exact `COUNT(*)`.

Always run **`rivet check --config <file>`** and adjust modes, destinations, and tuning before production runs.

### DECIMAL / NUMERIC column overrides

Rivet reads `numeric_precision` and `numeric_scale` from `information_schema.columns` during introspection. When a `NUMERIC` or `DECIMAL` column has explicit precision and scale, the scaffold automatically emits a `columns:` block with the correct `decimal(p,s)` override — so exports don't fail at runtime with an "unsupported type" error:

```yaml
exports:
  - name: payments
    query: >
      SELECT id, amount, fee
      FROM payments
    mode: chunked
    chunk_column: id
    chunk_size: 100000
    chunk_checkpoint: true
    format: parquet
    columns:
      amount: decimal(18,2)
      fee: decimal(18,6)
    destination:
      type: local
      path: ./output
```

If the column is declared as plain `NUMERIC` (no precision / scale in the DDL), `rivet init` still emits **`columns:`** so exports run: it uses **`decimal(38,18)`** as a wide default (`Decimal128` in Arrow), prefixes the YAML header with a **`# NOTE:`** pointing at these lines, and adds **`# REVIEW:`** inline on each such column — plus a **`rivet: note`** line on stderr when you write **`rivet init -o <file>`**. Replace the defaults with precision/scale from your domain (or constrain the DDL) before trusting the export:

```yaml
    columns:
      price: decimal(38,18)  # REVIEW: DDL has no numeric(p,s); edit when you know the real decimal(p,s); …
```

---

## Flags (summary)

| Flag | Required | Description |
|------|----------|-------------|
| `--source` | one-of `--source*` | `postgresql://` or `mysql://` URL — **visible in shell history / `ps` output; avoid in production** |
| `--source-env` | one-of `--source*` | Name of an env var holding the URL (e.g. `DATABASE_URL`). URL never hits the command line. **Recommended.** |
| `--source-file` | one-of `--source*` | Path to a file containing just the URL on one line. Credentials stay on disk. |
| `--table` | no | Single table; omit for schema-wide / database-wide scaffold |
| `--schema` | no | **PostgreSQL:** schema to scan (default `public`). **MySQL:** database name if missing from URL or to override URL database |
| `-o` / `--output` | no | Write output to file; default is stdout |
| `--discover` | no | Emit a **JSON discovery artifact** (Epic B) instead of a YAML scaffold — see below |

### Avoiding credentials on the command line

Shell history, process listings (`ps`, `/proc/<pid>/cmdline`), and container inspect logs all capture `--source "postgresql://user:pass@host/db"` verbatim. For anything beyond local dev, use `--source-env` or `--source-file`:

```bash
# Recommended — env var resolved inside the process only.
export DATABASE_URL='postgresql://user:pass@host:5432/db'
rivet init --source-env DATABASE_URL --schema public -o cfg.yaml

# File-based — useful when the URL is managed by your secrets mount.
rivet init --source-file /run/secrets/database_url --table orders -o cfg.yaml
```

Exactly one of `--source`, `--source-env`, `--source-file` must be provided (enforced by clap's ArgGroup).

## Discovery artifact (`--discover`)

![rivet init --discover + jq: ranked cursor + chunk candidates per table](../gifs/discover-artifact.gif)

`rivet init --discover` runs the same introspection but emits a machine-readable JSON document (schema described in [`src/init/artifact.rs`](../../src/init/artifact.rs)). Intended consumers: external orchestration tools, code review, and automated config generators.

```bash
rivet init --source "$PG_URL" --schema public --discover -o discovery.json
rivet init --source "$MY_URL" --table orders   --discover    # pipes JSON to stdout
```

Per-table fields (`tables[]`):

| Field | Description |
|---|---|
| `schema`, `table`, `row_estimate` | Table identity and cheap row metadata |
| `total_bytes` | Physical size (`pg_total_relation_size`; `DATA_LENGTH + INDEX_LENGTH`) when available |
| `suggested_mode` | `full` / `incremental` / `chunked` — same heuristic as the YAML scaffold |
| `cursor_candidates[]` | Ranked list with `{column, data_type, is_nullable, is_primary_key, score, reasons[]}`. Reasons use a stable snake_case vocabulary: `name_suggests_updated`, `name_suggests_created`, `timestamp_type`, `integer_monotonic`, `primary_key`, `nullable` |
| `suggested_cursor_fallback_column` | Set when the top cursor is nullable **and** a NOT-NULL timestamp sibling exists — hint to enable [`incremental_cursor_mode: coalesce`](../modes/incremental-coalesce.md) (ADR-0007) |
| `chunk_candidates[]` | Ranked integer columns for chunked mode |
| `notes[]` | Advisory strings surfaced to operators reviewing the artifact |

The artifact is advisory — same policy as plan prioritization (ADR-0006): no runtime effect, no auto-application.

---

## Docker Compose in this repository

The repo root [`docker-compose.yaml`](../../docker-compose.yaml) defines **Postgres** and **MySQL** (`rivet` / `rivet` users, database `rivet`) with the same schema as [`dev/postgres/init.sql`](../../dev/postgres/init.sql) and [`dev/mysql/init.sql`](../../dev/mysql/init.sql).

```bash
docker compose up -d postgres mysql
export PG_URL='postgresql://rivet:rivet@localhost:5432/rivet?sslmode=disable'
export MY_URL='mysql://rivet:rivet@localhost:3306/rivet'

# One table
rivet init --source "$PG_URL" --table orders -o rivet_orders.yaml

# Whole PostgreSQL schema public
rivet init --source "$PG_URL" --schema public -o rivet_public.yaml

# Whole MySQL database from URL
rivet init --source "$MY_URL" -o rivet_mysql.yaml
```

To refresh many files at once (per-table YAMLs plus combined schema snapshots), run [`dev/scripts/regenerate_docker_init_configs.sh`](../../dev/scripts/regenerate_docker_init_configs.sh) from the repo root after the DBs are up (and optionally seeded).

---

## Limitations

- **Not** a migration or DDL tool — only read-only introspection and YAML output.
- Views are included in schema-wide / database-wide runs; ensure each view is selectable for your user.
- Suggested modes are heuristics; large or sparse tables may need manual `chunked` / `chunk_dense` / `chunk_by_days` tuning (see [chunked mode](../modes/chunked.md)).