## TL;DR
- Provide a compile-time schema for Arrow using Rust types and macros.
- Generate monomorphized code per column index and base type (no runtime `DataType` switching).
- Add a unified facade to build `RecordBatch`es from rows for both compile-time and runtime schemas, inferring capacity from iterator size hints.
- Surface structured errors for the dynamic path (arity/type/builder/nullability errors). Nullability is validated before finishing via `try_finish_into_batch` to return a descriptive error instead of panicking. Keep the typed path infallible/zero-cost by default.
- Keep the unified facade focused on batch building (projection helpers are out of scope).
---
## Goals
- Encode each column’s Arrow type and nullability at the type level via `#[derive(Record)]`.
- Enable compile-time dispatch across columns and base types using traits/const-generics (no `match DataType` at runtime).
- Generate typed Arrow builders/arrays for each column from the compile-time schema.
- Provide a unified facade to build `RecordBatch`es from rows for both typed and dynamic schemas with inferred capacity.
- Surface errors in the dynamic path (row arity, type mismatches, builder failures, nullability violations) via a structured error type; `try_finish_into_batch` validates nullability for columns/fields/items and returns an error with path context.
- Keep ergonomics close to idiomatic Rust: field attributes express Arrow specifics; Option/Nullability is explicit.
- Keep projection helpers out of the unified facade to maintain focus and minimal surface area.
---
## High-Level Architecture
### Type-Level Schema (Core)
- `#[derive(Record)]` on a plain Rust struct generates:
- `impl Record for T` with `const LEN: usize`.
- Per-column type metadata via `ColAt<I>` exposing `type Rust`, `type ColumnBuilder`, `type ColumnArray`, `const NULLABLE`, `const NAME`, and `fn data_type() -> DataType`.
- A type-directed “for-each column” expansion hook via `ForEachCol::for_each_col`.
- Implemented traits (current):
- `Record`, `ColAt<I>`, `ForEachCol`.
- Row-based building: `BuildRows` with generated `<Type>Builders` and `<Type>Arrays`.
- Nested struct support: `StructMeta` (child fields + struct builder) and `AppendStruct` (append children into a StructBuilder).
- Nullability:
- Mode A: Use `Option<T>` in the field type.
- Mode B: `#[arrow(nullable)]` attribute even if field type is non-`Option` (explicit override). Derived code maps to `NULLABLE` const.
### Arrow Interop (Runtime Bridge)
- Type mapping trait:
- `trait ArrowBinding { type Builder; type Array: arrow_array::Array; fn data_type() -> DataType; ... }`
- Implementations map Rust types directly to arrow-rs typed builders/arrays without runtime dispatch.
- Builders API (row-based):
- `BuildRows` derive emits `<Type>Builders` and `<Type>Arrays`.
- `<Type>Builders` methods: `append_row(row)`, `append_rows(iter)`, `append_null_row()`, `append_option_row(Option<row>)`, `append_option_rows(iter)`; `finish()` returns `<Type>Arrays`.
- Nested structs are the default for struct-typed fields (no attribute needed), powered by `AppendStruct` and `StructMeta`.
- Future: optional `into_record_batch()` bridge when needed.
- Runtime schema:
- `schema::<R>()` available from `SchemaMeta` (typed). For runtime-only schemas, use `typed-arrow-dyn::DynSchema`.
- Compile-Time Dispatch (No runtime matches)
- Column visitor/iterator (generated by derive):
- `trait ColumnVisitor { fn visit<const I: usize, R>(FieldMeta<R>); }`
- `for_each_col::<V>()` expands to `V::visit::<I, Rust>(...)` for each column at compile time.
- Kernel pattern: implement `ColumnVisitor` for an operation; the monomorphized instances per column/base-type fall out of generics.
- Macro & Attribute Design
- `#[derive(Record)]` with field-level attributes:
- Struct-typed fields are nested `Struct` columns by default; use `Option<Nested>` on the field for column nullability.
- `#[derive(Union)]` (enums) supports Dense/Sparse modes and attributes:
- Container: `#[union(mode = "dense"|"sparse", null_variant = "Var", tags(A=10, B=7))]`
- Variant: `#[union(tag = 42)]`, `#[union(field = "name")]`, `#[union(null)]`
- Map uses wrappers rather than attributes: `Map<K, V, const SORTED: bool>` and `OrderedMap<K, V>`; value-nullability via `Option<V>`.
- Helper macro for column iteration (optional if derive emits trait impl):
- `for_each_col!(RecordType, |I, Meta| { /* compile-time expanded body */ });`
### Ergonomics
- Default mapping for well-known Rust types:
- `i8/i16/i32/i64`, `u8/u16/u32/u64`, `f32/f64`, `bool` map to Arrow primitives.
- `String` → `Utf8`, `Vec<u8>` → `Binary`.
- `Option<T>` toggles nullability at the column level.
- Nested & specialized wrappers (current):
- Lists: `List<T>` (items non-null), `List<Option<T>>` (items nullable); LargeList and FixedSizeList variants.
- Dictionary: `Dictionary<K, V>` for Utf8/Binary/FixedSizeBinary/primitives.
- Timestamp: `Timestamp<U>` (unit markers) and `TimestampTz<U, Z>` (timezone markers, e.g., `Utc`).
- Decimal: `Decimal128<const P: u8, const S: i8>`, `Decimal256<const P: u8, const S: i8>`.
- Map: `Map<K, V, const SORTED: bool>`; nullable values via `Option<V>`.
- Ordered Map: `OrderedMap<K, V>` with sorted keys (`keys_sorted = true`).
- Union: `#[derive(Union)]` with Dense and Sparse modes.
## Dynamic Runtime (typed-arrow-dyn)
- Row/cell types:
- `DynRow(Vec<Option<DynCell>>)` holds per-column optional values.
- `DynCell` variants cover Arrow logical types (bool, ints, floats, utf8/binary, dictionary via value types, and nested Struct/List/LargeList/FixedSizeList).
- Builders and schema:
- `DynSchema(Arc<Schema>)` and `DynBuilders` create and finish batches for runtime schemas.
- `DynRow::append_into` validates arity and value compatibility before appending.
- Nullability enforcement: dynamic builders do not check column/field/item nullability during appends; a validator runs at `try_finish_into_batch` to catch violations and return a structured error with the offending path and index.
- Factory: `new_dyn_builder(dt: &DataType)` selects a concrete builder implementation; nullability is not passed to the factory and is enforced by Arrow.
- Errors:
- `DynError` variants: `ArityMismatch`, `TypeMismatch { col, expected }`, `Builder { message }`, `Append { col, message }`.
- All dynamic append operations return `Result` and propagate context; no silent mismatches.
---
## Example
```rust
use typed_arrow::prelude::*;
#[derive(Record)]
struct Address { city: String, zip: Option<i32> }
#[derive(Record)]
struct Person {
id: i64,
// Nested struct field (no attribute needed)
address: Option<Address>,
email: Option<String>,
}
fn build_arrays_from_rows(rows: Vec<Option<Person>>) {
let mut b = <Person as BuildRows>::new_builders(rows.len());
b.append_option_rows(rows);
let arrays = b.finish();
// arrays.id: PrimitiveArray<Int64Type>
// arrays.address: StructArray
// arrays.email: StringArray
}
// Example compile-time dispatch
struct Count;
impl ColumnVisitor for Count {
fn visit<const I: usize, R>(_m: FieldMeta<R>) {
let _ = I;
}
}
fn debug_schema<T: ForEachCol>() { T::for_each_col::<Count>(); }
```
### Unified `build_batch` examples
Typed schema:
```rust
use typed_arrow::prelude::*;
use typed_arrow_unified::{SchemaLike, Typed};
#[derive(Record)]
struct Person { id: i64, name: Option<String> }
fn build_batch_typed() -> arrow_array::RecordBatch {
let rows = vec![
Person { id: 1, name: Some("a".into()) },
Person { id: 2, name: None },
];
let schema = Typed::<Person>::default();
schema.build_batch(rows).unwrap()
}
```
Dynamic schema:
```rust
use arrow_schema::{DataType, Field, Schema};
use typed_arrow_dyn::{DynCell, DynRow, DynSchema};
use typed_arrow_unified::SchemaLike;
fn build_batch_dynamic() -> arrow_array::RecordBatch {
let schema = Schema::new(vec![
Field::new("id", DataType::Int64, false),
Field::new("name", DataType::Utf8, true),
]);
let dyn_schema = DynSchema::new(schema);
let rows = vec![
DynRow(vec![Some(DynCell::I64(1)), Some(DynCell::Str("a".into()))]),
DynRow(vec![Some(DynCell::I64(2)), None]),
];
dyn_schema.build_batch(rows).expect("valid rows")
}
```
---
## Open Questions / Risks
- Timezones at type level: use const `&'static str` generic or a closed set of TZ types? Start with UTC only.
- Decimal overflow/scale enforcement in builders: enforce at append-time or constructor?
- Dictionary-encoded columns: how to expose keys/types at compile-time without specialization.
- Stable specialization: avoid it; prefer marker traits and explicit impls.
- Arrow-rs compatibility: gate features for versioned differences in array/builder APIs.
---
## Minimum Supported Rust Version (tentative)
- Rust 1.75+ (const generics for integers are sufficient; advance if we adopt additional const parameters).
---
# Repository Guidelines
## Project Structure & Module Organization
- `src/` — core library:
- `schema.rs`: `Record`, `ColAt<I>`, visitors, compile-time metadata.
- `bridge/` — Arrow interop:
- `mod.rs`: `ArrowBinding` trait and public re-exports
- `primitives.rs`: numeric primitives, `bool`, `f16`
- `strings.rs`: `String` (Utf8), `LargeUtf8`
- `binary.rs`: `Vec<u8>` (Binary), `[u8; N]` (FixedSizeBinary), `LargeBinary`
- `decimals.rs`: `Decimal128`, `Decimal256`
- `lists.rs`: `List`, `LargeList`, `FixedSizeList`
- `map.rs`: `Map`, `OrderedMap`
- `dictionary.rs`: `Dictionary`, `DictKey` and value impls
- `temporal.rs`: `Date32/64`, `Time32/64`, `Duration`, `Timestamp`, `TimestampTz`, TZ markers
- `intervals.rs`: `IntervalYearMonth`, `IntervalDayTime`, `IntervalMonthDayNano`
- `record_struct.rs`: blanket impl for `T: Record + StructMeta` → `StructArray`
- `column.rs`: `data_type_of<R, I>()`, `ColumnBuilder<R, I>`
- `lib.rs`: crate entry, prelude exports.
- `typed-arrow-derive/` — proc-macro crate implementing `#[derive(Record)]` and `#[derive(Union)]`.
- `tests/` — integration tests (e.g., `primitive_macro.rs`).
- `docs/` — design notes (e.g., `nested-types.md`).
- `examples/` — runnable demos (e.g., `examples/11_map.rs`).
## Build, Test, and Development Commands
- `cargo build` — builds the workspace (library + proc-macro).
- `cargo test` — runs unit and integration tests.
- `cargo check` — fast type-check of the workspace.
- Optional: `RUSTFLAGS='-D warnings' cargo build` to treat warnings as errors.
## Coding Style & Naming Conventions
- Rust 2021 edition; keep code idiomatic and minimal.
- Run `rustfmt` (standard Rust formatting). Use `clippy` locally for hints.
- Naming: modules `snake_case`, types and traits `CamelCase`, functions `snake_case`, constants `SCREAMING_SNAKE_CASE`.
- Public API docs: crates use `#![deny(missing_docs)]` — document new public items across typed, dynamic, and unified crates.
## Testing Guidelines
- Prefer focused tests; add integration tests in `tests/` for end‑to‑end flows.
- Validate that `ColAt<I>` exposes `fn data_type()`, `ColumnBuilder`, `ColumnArray` and that builders produce typed arrays.
- Exercise row-based building (`append_row`, `append_option_rows`) and nested struct append.
- Validate DataType shapes (child names, `keys_sorted`, union tags/fields) and append semantics.
- Prefer names/tags over child indices when asserting nested/union children.
- Run locally with `cargo test -q` and `cargo clippy --workspace -D warnings`.
## Commit & Pull Request Guidelines
- Commits: clear, imperative mood (e.g., “Add ArrowBinding for u32”).
- Prefer small, focused commits with meaningful scope; reference issues when relevant.
- PRs must include:
- What changed and why (problem statement + approach).
- Tests or rationale for test coverage.
- Notes on API or behavior changes (breaking/experimental).
## Architecture Overview
- Compile-time schema via `#[derive(Record)]` generates `Record`, `ColAt<I>`, and `ForEachCol`.
- `ColAt<I>` exposes `Rust` (inner type), `data_type()`, `ColumnBuilder`, and `ColumnArray` — no runtime `DataType` switching.
- `bridge::ArrowBinding` maps Rust types and wrappers (e.g., primitives, Utf8/Binary, List/LargeList/FixedSizeList, Map/OrderedMap, Decimal128/256, Timestamp/TimestampTz, Dictionary, and Union via derive) to Arrow builders/arrays.
- Unified facade:
- `SchemaLike::build_batch` unifies batch assembly for typed (`Typed<R>`) and dynamic (`DynSchema`/`Arc<Schema>`).
- `BuildersLike` abstracts append/finish with `Result`-returning dynamic path and infallible typed path.