rivet-cli 0.16.3

Rivet: PostgreSQL/MySQL/SQL Server → Parquet/CSV (local, S3, GCS, Azure). Crate name rivet-cli; binary rivet.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
# Rivet type mapping contract

This document describes how Rivet maps source column types to logical
[`RivetType`](../src/types/rivet_type.rs) values and Arrow/Parquet/CSV
representations. It is aligned with the automated suite in
[`tests/type_roundtrip/`](../tests/type_roundtrip/) and
[`tests/live_type_golden.rs`](../tests/live_type_golden.rs).

## The short version

Wondering "will my decimals / UUIDs / JSON / timestamps silently break on the way
out?" — the short answer is no:

- **Decimals never become floats**`DECIMAL`/`NUMERIC` export as exact Arrow
  `Decimal128`/`Decimal256` when precision and scale are known.
- **UUID and JSON keep native types** — Parquet gets native `LogicalType::Uuid` /
  `LogicalType::Json`, not opaque strings.
- **Timestamps preserve the instant** — the point in time round-trips (Parquet
  keeps the zone; CSV deliberately emits the instant without an offset).
- **Rivet fails loud, not silent** — a lossy or unmapped type is named by
  `rivet check` and aborts the run, never quietly truncated.

The per-engine tables below are the precise contract; this is just the gist.

## Guarantees (v0.7.8)

- **DECIMAL / NUMERIC** are never silently converted to float. They export as
  Arrow `Decimal128` / `Decimal256` in Parquet when precision and scale are known
  (column override, catalog hint, or PostgreSQL wire metadata).
- **Binary** (`BYTEA`, `BLOB`, `BINARY`/`VARBINARY` with charset 63) stays
  Arrow `Binary` in Parquet.
- **Float `NaN` / `±Infinity`** are preserved. Parquet stores them natively
  (IEEE-754); CSV emits the literal `NaN` / `inf` / `-inf` (and `-0` keeps its
  sign) rather than an empty cell — an empty cell would silently conflate a real
  special value with `NULL`. Note these are *float* values: `decimal` / `NUMERIC`
  has no NaN representation and a NaN/Infinity payload there is rejected at
  extract time, not coerced. Strict CSV loaders that expect `Infinity` over `inf`
  should configure their float parser accordingly; Parquet needs no such care.
- **JSON / JSONB** is valid JSON text in the file, paired with the Arrow
  `arrow.json` canonical extension type so parquet-rs emits native
  `LogicalType::Json` in the Parquet footer. The Rivet field metadata
  (`rivet.logical_type=json`) stays for Rivet-aware consumers.
- **UUID** exports as canonical 16-byte `FixedSizeBinary(16)` paired with the
  Arrow `arrow.uuid` canonical extension; parquet-rs emits native
  `LogicalType::Uuid`. Downstream Parquet readers (DuckDB, ClickHouse 25.x+,
  pyarrow, BigQuery autodetect) recover the UUID type without a cast.
- **Timestamps**: `TIMESTAMPTZ` and MySQL `TIMESTAMP` use UTC semantics
  (`timezone: Some("UTC")`); naive `TIMESTAMP` / `DATETIME` have no timezone.
- **Nullability** is preserved in Arrow schema and export.

### What `rivet init` auto-detects (and what needs an override)

`rivet init` reads the source database's catalog / wire-protocol metadata to
build the initial `rivet.yaml`. Whether a column lands as a native logical
type in the resulting Parquet depends on whether the source server
*advertises* the semantic — Rivet never guesses from column names or sample
values, by design (a wrong guess is silent corruption; an honest "I don't
know" is a config knob).

| Semantic | PostgreSQL | MySQL |
|----------|------------|-------|
| `JSON` / `JSONB` | **auto**`Type::JSON` (OID 114) and `Type::JSONB` (OID 3802) are native PG wire types. | **auto**`MYSQL_TYPE_JSON` is native since MySQL 5.7. |
| `UUID` | **auto**`Type::UUID` (OID 2950) is a native PG type. | **manual override required.** MySQL has no native UUID; they are stored in `VARCHAR(36)` or `BINARY(16)`. The catalog reports only `varchar`/`binary` — semantic UUID information is gone before the driver ever sees the column. Operators add an explicit override (see below). |
| `DECIMAL(p,s)` | **auto when declared** — PG's catalog returns `numeric_precision`/`numeric_scale` for table-qualified queries; ad-hoc `numeric` expressions need an override. | **manual override required** for ad-hoc queries — the mysql wire protocol does not expose precision/scale on the column descriptor. Required for any column the user wants exact-decimal for. |

Adding overrides looks like:

```yaml
exports:
  - name: users
    query: "SELECT id, uid, amount FROM users"
    columns:
      uid: uuid              # MySQL VARCHAR(36) → UUID semantic
      amount: decimal(18,2)  # also valid for MySQL DECIMAL without catalog
      event_ts: timestamp_ns # keep SQL Server datetime2(7)'s 100 ns tick (see gap 4)
```

Supported override types: `bool`, `int2/int4/int8`, `float4/float8`,
`decimal(p,s)`, `date`, `timestamp`, `timestamp_ns`, `timestamp_tz`,
`timestamp_tz_ns`, `text`, `binary`, `json`, `uuid`. The `_ns` timestamp
variants preserve sub-microsecond precision (range 1677–2262; see gap 4) — the
plain `timestamp` is microsecond with full date range.

The override path is safe: a `uid: uuid` declaration tries to parse each
cell — either 16 raw bytes (BINARY(16) layout) or a 36-char canonical text
form. **Parse failures emit `NULL`, never silent garbage.** Same convention
as `decimal` overrides whose values do not parse cleanly.

`rivet init` does **not** apply heuristics ("column is 36 chars wide and
named `uid_*` → probably UUID") — guessing risks misclassifying a non-UUID
varchar and silently producing wrong-shape Parquet. Operators who want UUID
semantics on a MySQL column add the override explicitly.

## Test commands

```bash
make test-types              # offline mapping contracts (PR-fast)
make test-types-live         # full matrix; requires docker compose
make test-types-validators   # PG/MySQL/SQL Server → Parquet → {DuckDB, ClickHouse} round-trip
```

`test-types-validators` re-runs the same canonical PG / MySQL / SQL Server type matrix used
by `test-types-live`, but writes the Parquet into the shared bind-mount under
`tests/.live-tmp/` and feeds it through **three independent readers**: DuckDB,
ClickHouse (both from `docker-compose.yaml`), and pyarrow (installed alongside
duckdb in the same container). Each reader catches what the others cannot:

| Reader | What it pins |
|--------|--------------|
| DuckDB | Autoload physical types (`DECIMAL`, `TIMESTAMPTZ`, `INTEGER[]`, `BLOB`, …), decimal sums, JSON validity, UUID parseability, byte-exact BLOB, list lengths (including empty), null-bitmap propagation |
| ClickHouse | Independent confirmation of the above through a second decoder; native `UInt64` round-trip for `BIGINT UNSIGNED`; tz-aware timestamps as `DateTime64(6, 'UTC')` |
| pyarrow | Arrow field metadata (`rivet.*` keys) reaches the Parquet footer; row-group statistics (`min`/`max`/`null_count`) are correct; Decimal256 (precision > 38) round-trips exactly where DuckDB downgrades to DOUBLE |

A fourth reader — **BigQuery** — is exercised by a separate harness
(`docs/bench/harness/type_bench_bq.sh`) that requires a Google Cloud
project + `bq` CLI auth. It loads each Rivet-produced Parquet via
`bq load --autodetect --source_format=PARQUET`, captures the
BigQuery-inferred schema, and asserts decimal sums round-trip through
`bq query`. See [`docs/bench/reports/REPORT_types_bigquery.md`](bench/reports/REPORT_types_bigquery.md)
for the per-column table and BigQuery-specific findings (notably:
`LogicalType::Json` does **not** autoload as native BQ `JSON` — it
falls back to `BYTES`/`STRING`; values are valid JSON but operators
need an explicit `--schema='attrs:JSON,...'` to query the structure).

In addition, two structural-only tests pin the Parquet layout itself —
[`parquet_schema.rs`](../tests/type_roundtrip/parquet_schema.rs) for
physical + logical types per column, and
[`parquet_metadata.rs`](../tests/type_roundtrip/parquet_metadata.rs) for the
`rivet.native_type` / `rivet.fidelity` / `rivet.logical_type` field metadata.

Coverage extensions live in dedicated files:
[`compression_matrix.rs`](../tests/type_roundtrip/compression_matrix.rs)
re-runs the export under `zstd` / `snappy` / `gzip` / `none` and asserts
value parity across all four codecs;
[`csv_load.rs`](../tests/type_roundtrip/csv_load.rs) replays the matrix
through DuckDB `read_csv_auto`;
[`pg_edge_cases.rs`](../tests/type_roundtrip/pg_edge_cases.rs) covers
decimal precision boundaries (38, 39 / Decimal128↔Decimal256), tz timestamps
pre-epoch and far-future, JSON deep nesting + unicode keys + i64 edges,
arrays with NULL elements, large single-cell strings.

### v0.7.8 breaking change: UUID export layout

UUID columns (PG native `uuid` type and MySQL columns explicitly overridden
as `columns: { col: uuid }`) now export as `FixedSizeBinary(16)` instead of
hyphenated `Utf8`. This is what lets parquet-rs emit native
`LogicalType::Uuid` so downstream readers (DuckDB, ClickHouse 25.x+,
pyarrow, BigQuery autodetect) recover the UUID type without a cast.

What this changes:

- **Parquet schema**: `BYTE_ARRAY + LogicalType::String``FIXED_LEN_BYTE_ARRAY + LogicalType::Uuid`.
- **On-disk bytes**: 36-char canonical ASCII (`a0eebc99-9c0b-…`) → 16 raw bytes (the same UUID, compact encoding).
- **CSV**: unchanged — the CSV writer still emits hyphenated lowercase text.
- **DuckDB autoload**: `VARCHAR``UUID` native type. A view like
  `SELECT uid::VARCHAR FROM read_parquet(...)` keeps working because
  DuckDB's `UUID::VARCHAR` cast yields the canonical form.
- **ClickHouse 24.8 autoload**: `String``FixedString(16)` (the bytes,
  not the text). To get back the canonical string, use
  `lower(hex(uid))` and reformat, or upgrade to ClickHouse 25.x which
  decodes `LogicalType::Uuid` directly.

Consumers reading the old Utf8-shaped Parquet files keep working — only
files produced by Rivet ≥ v0.7.8 carry the new layout.

### Findings & fixes from triangulating against external readers

Driving the matrix through DuckDB + ClickHouse + pyarrow exposed three real
defects in the PG / MySQL drivers; all have been fixed in v0.7.8:

* **PG arrays with NULL elements were silently lost.** The driver decoded
  `ARRAY[1, NULL, 3]` via `try_get::<Vec<i32>>`, which errors on a NULL
  element; the error was swallowed and a whole-row NULL was written. Fixed
  by deserializing as `Vec<Option<T>>` and pushing nulls through
  `ListBuilder::append_null` ([`src/source/postgres/arrow_convert.rs`]../src/source/postgres/arrow_convert.rs).
* **MySQL ENUM / SET were misclassified as `String`.** They arrive on the
  wire as `MYSQL_TYPE_STRING` / `MYSQL_TYPE_VAR_STRING` with the
  `ENUM_FLAG` / `SET_FLAG` set, not as `MYSQL_TYPE_ENUM`. The mapper now
  checks the flag and emits `RivetType::Enum` so the
  `rivet.logical_type=enum` Parquet metadata is preserved.
* **MySQL `native_type` lost precision.** `tinyint unsigned`, `tinyint(1)`,
  `bit(1)`, `char` vs `varchar`, `binary` vs `varbinary` all collapsed to a
  single label. The mapper now distinguishes them via `column_type()` +
  `flags()` + `character_set`.

### Known limitations (pinned by `*currently_fails*` tests)

* PG `numeric(p, -s)` (negative scale) cannot be written to Parquet — the
  spec requires non-negative DECIMAL scale. Test
  `pg_edge_decimal_negative_scale_currently_fails_at_parquet_write`
  pins the failure with a friendly error message so any future workaround
  must update the test deliberately.
* DuckDB's `DECIMAL` is capped at precision 38 (HUGEINT-backed). For
  Parquet files with `precision > 38` DuckDB silently returns DOUBLE.
  Our file still carries `LogicalType::Decimal(p, s)` correctly — pyarrow
  decodes it as `Decimal256`. Verified in
  `pg_edge_decimal_boundaries_round_trip`.

## PostgreSQL

| Source type | Rivet logical | Parquet (Arrow) | CSV | Notes | Tested |
|-------------|---------------|-----------------|-----|-------|--------|
| `smallint` | `int16` | `Int16` | integer text | | golden |
| `integer` | `int32` | `Int32` | integer text | | golden |
| `bigint` | `int64` | `Int64` | integer text | | golden |
| `numeric(p,s)` | `decimal(p,s)` | `DECIMAL(p,s)` | exact decimal text | override if unbounded in query | contract + live matrix |
| `real` | `float32` | `Float32` | float text | | golden |
| `double precision` | `float64` | `Float64` | float text | | golden |
| `date` | `date` | `Date32` | ISO date | | golden |
| `time` | `time(microsecond)` | `Time64(µs)` | time text | | partial |
| `timestamp` | `timestamp(microsecond)` | `Timestamp(µs, None)` | datetime text | naive wall clock | live matrix |
| `timestamptz` | `timestamp_tz(µs, UTC)` | `Timestamp(µs, UTC)` | datetime text | | live matrix |
| `text` / `varchar` | `string` | `Utf8` | escaped UTF-8 | newlines/quotes escaped | live matrix |
| `bytea` | `binary` | `Binary` | hex or base64 per writer | | live matrix |
| `json` / `jsonb` | `json` | `Utf8` + Parquet `LogicalType::Json` (via `arrow.json` extension) | JSON string | | live matrix |
| `uuid` | `uuid` | `FixedSizeBinary(16)` + Parquet `LogicalType::Uuid` (via `arrow.uuid` extension) | canonical UUID text | downstream readers autoload as native UUID type | golden |
| `boolean` | `bool` | `Boolean` | true/false | | type_roundtrip |
| `numeric(10,2)` | `decimal(10,2)` | `DECIMAL(10,2)` | exact decimal text | second precision tier | type_roundtrip |
| `char` / `bpchar` | `string` | `Utf8` | escaped UTF-8 | padded char | type_roundtrip |
| `interval` | `interval` | `Utf8` (ISO 8601) | duration text | not Parquet Interval type | type_roundtrip |
| `enum` | `enum` | `Utf8` + logical=enum | label text | custom PG enum | type_roundtrip |
| `text[]` | `list<string>` | `List<Utf8>` || 1-D arrays | type_roundtrip |
| `integer[]` | `list<int32>` | `List<Int32>` || 1-D arrays | type_roundtrip |
| nullable / all-null || null bitmap preserved | empty cells | `note_nullable`, `note_all_null` | type_roundtrip |
| large `text` | `string` | `Utf8` | escaped | 2k–5k chars | type_roundtrip |

## MySQL

| Source type | Rivet logical | Parquet (Arrow) | CSV | Notes | Tested |
|-------------|---------------|-----------------|-----|-------|--------|
| `tinyint` (not width 1) | `int16` | `Int16` | integer text | widened signed | golden |
| `tinyint(1)` | `bool` | `Boolean` | 0/1 | MySQL boolean convention | golden |
| `smallint` | `int16` | `Int16` | integer text | | golden |
| `int` | `int32` | `Int32` | integer text | | golden |
| `bigint` | `int64` | `Int64` | integer text | signed | golden |
| `bigint unsigned` | `u_int64``UInt64` | `UInt64` | integer text | values &gt; i64::MAX | type_roundtrip |
| `decimal(p,s)` | `decimal(p,s)` | `DECIMAL(p,s)` | exact decimal text | requires `columns:` override | live matrix |
| `float` / `double` | `float32` / `float64` | `Float32` / `Float64` | float text | | golden |
| `date` | `date` | `Date32` | ISO date | | golden |
| `datetime` | `timestamp(µs, none)` | `Timestamp(µs, None)` | datetime text | naive | live matrix |
| `timestamp` | `timestamp_tz(µs, UTC)` | `Timestamp(µs, UTC)` | datetime text | `SET time_zone = '+00:00'` | live matrix |
| `time` | `time(µs)` | `Time64(µs)` | time text | | partial |
| `varchar` / `text` | `string` / `text` | `Utf8` | escaped UTF-8 | | live matrix |
| `json` | `json` | `Utf8` + Parquet `LogicalType::Json` (via `arrow.json` extension) | JSON string | | live matrix |
| `binary` / `varbinary` / `blob` | `binary` | `Binary` | hex in CSV | charset 63 / binary payload | type_roundtrip |
| `bit(1)` | `bool` | `Boolean` | | | golden |
| `bit(n>1)` | `int64` | `Int64` | | avoids silent truncation | type_roundtrip |
| `tinyint unsigned` | `int16` | `Int16` | integer text | 0–255 | type_roundtrip |
| `smallint unsigned` | `int32` | `Int32` | integer text | up to 65535 | type_roundtrip |
| `int unsigned` | `int64` | `Int64` | integer text | up to 4294967295 | type_roundtrip |
| `decimal(10,2)` | `decimal(10,2)` | `DECIMAL(10,2)` | exact decimal text | `columns:` override | type_roundtrip |
| `char` | `string` | `Utf8` | escaped | fixed `CHAR(n)` | type_roundtrip |
| `mediumtext` / `longtext` | `text` | `Utf8` | escaped | large payloads | type_roundtrip |
| `enum` / `set` | `enum` | `Utf8` + logical=enum | label text | SET comma-separated | type_roundtrip |
| `year` | `int16` | `Int16` | integer text | calendar year | type_roundtrip |
| `boolean` (native) | `bool` | `Boolean` | true/false | not only `TINYINT(1)` | type_roundtrip |
| nullable / all-null || preserved | empty cells | edge columns | type_roundtrip |

## SQL Server (MSSQL)

| Source type | Rivet logical | Parquet (Arrow) | CSV | Notes | Tested |
|-------------|---------------|-----------------|-----|-------|--------|
| `tinyint` (0–255) | `int16` | `Int16` | integer text | widened (unsigned source) | live matrix |
| `smallint` | `int16` | `Int16` | integer text | | live matrix |
| `int` | `int32` | `Int32` | integer text | | live matrix |
| `bigint` | `int64` | `Int64` | integer text | | live matrix |
| `bit` | `bool` | `Boolean` | 0/1 | | live matrix |
| `decimal(p,s)` / `numeric(p,s)` | `decimal(p,s)` | `DECIMAL(p,s)` | exact decimal text | scale recovered from the data (tiberius drops declared scale) | live matrix |
| `money` | `decimal(19,4)` | `DECIMAL(19,4)` | exact decimal text | fixed scale | live matrix |
| `smallmoney` | `decimal(10,4)` | `DECIMAL(10,4)` | exact decimal text | fixed scale | type_roundtrip |
| `real` | `float32` | `Float32` | float text | | live matrix |
| `float` | `float64` | `Float64` | float text | | live matrix |
| `date` | `date` | `Date32` | ISO date | | live matrix |
| `time` | `time(µs)` | `Time64(µs)` | time text | µs precision | live matrix |
| `datetime2` / `datetime` / `smalldatetime` | `timestamp(µs, none)` | `Timestamp(µs, None)` | datetime text | naive; µs default (full range), `datetime2(7)`'s 100 ns tick truncated — opt into `timestamp_ns` to keep it, see known gap 4 | live matrix |
| `datetimeoffset` | `timestamp_tz(µs, UTC)` | `Timestamp(µs, UTC)` | datetime text | normalised to UTC | type_roundtrip |
| `nvarchar` / `varchar` / `nchar` / `char` / `text` / `ntext` | `string` | `Utf8` | escaped UTF-8 | | live matrix |
| `varbinary` / `binary` / `image` | `binary` | `Binary` | hex in CSV | | live matrix |
| `uniqueidentifier` | `uuid` | `FixedSizeBinary(16)` + Parquet `LogicalType::Uuid` | canonical UUID text | native UUID downstream | live matrix |
| nullable / all-null || preserved | empty cells | | live matrix |

Unmapped SQL Server types resolve to `Unsupported` and fail loudly at schema
build unless a `columns:` override maps them.

## Known gaps (tracked)

1. **Nested arrays, ranges, inet, PostGIS, geometry**: not in the type matrix —
   they resolve to `Unsupported` and fail at schema build unless a `columns:`
   override maps them.
2. **Nullability**: every exported column is `OPTIONAL`; a source `NOT NULL`
   constraint is not propagated into the Parquet schema (ADR-0016, deferred to
   v0.8 Phase A).
3. **CSV complex types**: arrays (`List`), `Decimal256` (precision > 38), and
   non-UUID fixed binary have no CSV cell — the export **fails loudly** naming
   the column (see [CSV serialization]#csv-serialization) rather than silently
   writing empty values.
4. **SQL Server `datetime2` sub-microsecond precision***default is
   microsecond; nanosecond is opt-in.* rivet maps `datetime2` to `Timestamp(µs)`
   **by default**, because Arrow nanosecond timestamps are i64 ns and span only
   **1677-09-21 .. 2262-04-11**, while `datetime2` spans 0001–9999 — a blanket ns
   mapping would silently corrupt any value outside that window (a far worse bug
   than the precision gap). So the 7th fractional digit of a `datetime2(7)`
   (100 ns) is truncated to µs by default; lossless for `datetime2(6)` and below.

   To **preserve the 100 ns tick** on a column whose data is inside the ns range,
   opt in per column with a `timestamp_ns` override:

   ```yaml
   columns:
     event_ts: timestamp_ns       # naive; use timestamp_tz_ns for datetimeoffset
   ```

   The Parquet file then carries `Timestamp(ns)` and the full precision survives
   (verified live 2026-06-07: DuckDB reads it natively as `TIMESTAMP_NS`,
   `…12:00:00.1234567` intact; the default µs path truncates to `.123456`).
   **Caveats (all verified live 2026-06-07):**
   - A value outside 1677–2262 exports as **NULL** (Arrow ns range).
   - **DuckDB** — native `TIMESTAMP_NS`, fully lossless.
   - **Snowflake** — autoloads as `NUMBER(38,0)` (raw nanos); `TO_TIMESTAMP_NTZ(col, 9)`
     recovers a **lossless** `TIMESTAMP_NTZ` (it holds 9 digits — the 7th survives).
   - **BigQuery** — autoloads as `INT64` (raw nanos, lossless as an integer); a
     native `TIMESTAMP_MICROS(DIV(col,1000))` is **lossy** (BigQuery `TIMESTAMP`
     is microsecond — the 7th digit drops). Keep the default `timestamp` for
     BigQuery unless you carry the raw nanos.

   **Incremental mode:** the default µs cursor on a `datetime2(7)` lands one tick
   below the source max, re-exporting the boundary row every run — use
   `timestamp_ns` (the cursor literal then carries all 9 digits), a `datetime2(6)`
   (or coarser) cursor, or an integer/identity cursor.

(MySQL `DECIMAL` now resolves its precision/scale from the wire column
definition — no override needed; see the MySQL section.)

## Fidelity labels

| Label | Meaning |
|-------|---------|
| `exact` | Value and type semantics preserved |
| `compatible` | Value preserved; physical type differs (e.g. UUID as Utf8) |
| `logical_string` | Valid text; native JSON tree semantics not enforced in Arrow |
| `lossy` | Rejected in strict mode |
| `unsupported` | Requires policy override |

See [`src/types/fidelity.rs`](../src/types/fidelity.rs).

## CSV serialization

CSV shares the same Arrow `RecordBatch` as Parquet, so values are identical —
only the text rendering differs ([`src/format/csv.rs`](../src/format/csv.rs)):

| RivetType | CSV rendering |
|-----------|---------------|
| ints / `float` / `bool` / `decimal` | plain text (`decimal` exact, never via float) |
| `string` / `text` / `json` / `enum` / `interval` | text, RFC-4180 quoted/escaped when needed |
| `uuid` | canonical hyphenated lowercase (`a0eebc99-…`) |
| `binary` (`bytea` / `BLOB`) | lowercase hex (`deadbeef`) |
| `date` / `time` / `timestamp` | ISO 8601 (`2026-01-01T12:00:00.000000`) |
| `timestamp_tz` | same ISO text as a naive timestamp — the UTC offset is **not** rendered (the instant is preserved, but a reader can't distinguish it from a naive value) |

CSV has **no cell representation** for `list` (arrays), `Decimal256`
(precision > 38), or non-UUID fixed binary. Rather than silently write an empty
value, the export **fails at writer creation** naming the column — use
`format: parquet` or drop the column from the query.

**Loading CSV into a warehouse:** unlike Parquet (whose loader will not coerce a
declared type — see below), BigQuery's CSV loader *honors* a declared `--schema`.
Bare `--autodetect` infers `decimal` text as `FLOAT` (precision-lossy) and every
semantic type (`uuid` / `json` / `bytea`) as `STRING`; declare `NUMERIC` / `JSON`
in the load schema, or recover post-load with `PARSE_JSON` / `FROM_HEX` as for
Parquet.

## Downstream targets (autoload vs native)

Parquet/CSV preserve values and Arrow physical types. Warehouse engines infer types from
physical schema on autoload (e.g. JSON columns appear as `STRING` / `VARCHAR`). Native
target types (`JSON`, `VARIANT`, `UUID`, …) require a **materialization** step (cast SQL,
load schema, or typed view).

See [ADR-0014: Target type materialization](adr/0014-target-type-materialization.md) for
the full matrix (DuckDB, BigQuery, Snowflake, ClickHouse) and planned
`rivet check --type-report --target <engine>` extensions.

### Verified physical autoload (DuckDB + ClickHouse, v0.7.8)

`make test-types-validators` re-reads every PG / MySQL matrix column through
two independent engines and pins the autoload type. The full matrix:

| RivetType (PG / MySQL source) | DuckDB `DESCRIBE` | ClickHouse `DESCRIBE TABLE file()` |
|-------------------------------|--------------------|------------------------------------|
| `int16` / `smallint`          | `SMALLINT`         | `Nullable(Int16)`                  |
| `int32` / `integer`           | `INTEGER`          | `Nullable(Int32)`                  |
| `int64` / `bigint`            | `BIGINT`           | `Nullable(Int64)`                  |
| `u_int64` (MySQL `BIGINT UNSIGNED`) | `UBIGINT`    | `Nullable(UInt64)`                 |
| `decimal(p,s)`                | `DECIMAL(p,s)`     | `Nullable(Decimal(p, s))`          |
| `float32` / `real`            | `FLOAT`            | `Nullable(Float32)`                |
| `float64` / `double precision`| `DOUBLE`           | `Nullable(Float64)`                |
| `date`                        | `DATE`             | `Nullable(Date32)`                 |
| `time(µs)`                    | `TIME`             | `Nullable(DateTime64(6))`          |
| `timestamp(µs)` (naive)       | `TIMESTAMP`        | `Nullable(DateTime64(6))`          |
| `timestamp_tz(µs, UTC)`       | `TIMESTAMP WITH TIME ZONE` | `Nullable(DateTime64(6, 'UTC'))` |
| `string` / `text` / `enum` / `interval` | `VARCHAR` | `Nullable(String)` |
| `json` (PG `JSON`/`JSONB`, MySQL `JSON`) | `JSON` | `Nullable(String)` (ClickHouse 24.8) — DuckDB autoloads as native `JSON` |
| `uuid` (PG native; MySQL via override) | `UUID` | `Nullable(FixedString(16))` (ClickHouse 24.8) — DuckDB autoloads as native `UUID` |
| `binary` (`bytea` / `BLOB`)   | `BLOB`             | `Nullable(String)` (raw bytes)     |
| `bool`                        | `BOOLEAN`          | `Nullable(Bool)`                   |
| `list<inner>`                 | `inner[]`          | `Array(Nullable(inner))`           |

Round-trip values that the suite pins per row: `sum(decimal * 10^scale)`
matches the in-process Arrow check; `json_valid` / `isValidJSON` returns true
for every row; `TRY_CAST(uid AS UUID)` / `toUUID(uid)` parses every row; BLOB
columns compare byte-for-byte (`hex(...)`); the empty list survives; null
bitmaps propagate. The MySQL `BIGINT UNSIGNED` max value (`2^64 − 1`) is the
load-bearing assertion that exact-width unsigned ints are not silently
overflowed to i64.

### BigQuery autoload & recovery (verified live)

BigQuery's Parquet loader is weaker than DuckDB's: it ignores several Parquet
logical types on autoload and — critically — **will not coerce a column to a
different declared type on load**. A `bq load` into a table that declares
`JSON`/`DATETIME` is *rejected* (`Field x has changed type from JSON to BYTES`),
so native types are recovered with a **post-load transform, not a load schema**.

| RivetType | BigQuery autoload | Native | Recovery (post-load) |
|-----------|-------------------|--------|----------------------|
| `json`    | `BYTES`           | `JSON` | `PARSE_JSON(SAFE_CONVERT_BYTES_TO_STRING(col))` |
| `uuid`    | `BYTES` (16 raw)  | `STRING` | `TO_HEX(col)` |
| `timestamp` (naive) | `TIMESTAMP` (instant) | `DATETIME` | `DATETIME(col)` |
| `list<inner>` | `RECORD{item}` | `REPEATED inner` | load staging with `--parquet_enable_list_inference`, then `ARRAY(SELECT el.item FROM UNNEST(col) AS el)` |
| `u_int64` | `INT64` (overflows > 2^63−1) | `NUMERIC` | none post-load — fix at source: `columns: { c: decimal(20,0) }` |
| `timestamp_tz`, `decimal`, `string`, `binary`, `bool`, ints | native | same ||

Rivet writes the Parquet list element as `item` (arrow-rs default, not the
spec's `element`), so even `--parquet_enable_list_inference` yields
`REPEATED RECORD{item}` rather than a clean `REPEATED <scalar>` — the `UNNEST`
flatten above is required.

`rivet check --type-report --target bigquery` prints the per-column autoload
type, the native type, and a ready-to-run recovery `CREATE TABLE … AS SELECT`
over the autoloaded `<table>__staging`. Set `exports[].target: bigquery` to get
it without the CLI flag. DuckDB needs none of this — it autoloads every logical
type natively.

### Snowflake autoload & recovery (verified live)

Snowflake's `INFER_SCHEMA` + `COPY` infers physical types only, so the same
semantic types degrade — and the `INFER_SCHEMA` column names come back
**lowercase and case-sensitive**, so the recovery `SELECT` must double-quote
every source reference (`"col"`).

| RivetType | Snowflake autoload | Native | Recovery (post-load) |
|-----------|--------------------|--------|----------------------|
| `json`    | `TEXT`             | `VARIANT` | `PARSE_JSON("col")` |
| `uuid`    | `BINARY` (16 raw)  | `TEXT` | `REGEXP_REPLACE(LOWER(HEX_ENCODE("col")), …)` → canonical UUID |
| `timestamp` (naive) | `NUMBER` (µs) | `TIMESTAMP_NTZ` | `TO_TIMESTAMP_NTZ("col", 6)` |
| `time`    | `NUMBER` (µs of day) | `TIME` | `TIME_FROM_PARTS(0,0,FLOOR("col"/1000000),MOD("col",1000000)*1000)` |
| `binary`  | `BINARY` *(needs `BINARY_AS_TEXT=FALSE`)* | `BINARY` | — (set the file-format option) |
| `timestamp_tz` | `TIMESTAMP_TZ` *(pin session `TIMEZONE='UTC'`)* | `TIMESTAMP_TZ` | — (autoload uses session offset otherwise) |
| `u_int64` | `NUMBER` (overflows > 2^63−1) | `NUMBER(20,0)` | none post-load — fix at source: `columns: { c: decimal(20,0) }` |
| `list<inner>` | `VARIANT` (the JSON array) | `ARRAY` | `"col"::ARRAY` |
| `decimal`, `string`, `bool`, `date`, ints | native | same ||

The load preamble the recovery depends on: `CREATE FILE FORMAT … TYPE=PARQUET
BINARY_AS_TEXT=FALSE`, `ALTER SESSION SET TIMEZONE='UTC'`, `CREATE TABLE … USING
TEMPLATE (… INFER_SCHEMA …)`, `COPY … MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE`.

`rivet check --type-report --target snowflake` (`--target sf`) emits the
per-column autoload/native types and the post-load `CREATE OR REPLACE TABLE …`
recovery over `<table>__staging`. Set `exports[].target: snowflake` to skip the
flag.