rust-data-processing 0.2.2

Schema-first ingestion (CSV, JSON, Parquet, Excel) into an in-memory DataSet, plus Polars-backed pipelines, SQL, profiling, validation, and map/reduce-style processing.
# Delta / Iceberg → `DataSet` (limits and handoff)

## What this library does **not** ship (Phase 2)

- **No embedded Delta commit log reader** or Iceberg manifest reader in the default `rust-data-processing` build (see [`Planning/ADR_P2_E2_LAKE_TABLE_READ.md`]../Planning/ADR_P2_E2_LAKE_TABLE_READ.md).
- **No catalog** (REST, Glue, Unity), **no time travel** selection beyond what your export tool writes into files.
- **No distributed** scan or shuffle.

## Recommended patterns

### 1. Export to Parquet (simplest)

Use **Spark**, **Databricks**, **Python `deltalake`**, or **Trino** to `COPY` / write a Parquet directory or single file, then:

```rust
use rust_data_processing::ingestion::{ingest_from_path, IngestionOptions};
// ingest_from_path(..., IngestionOptions::default()) for .parquet
```

### 2. Arrow `RecordBatch` handoff (Rust, `--features arrow`)

Read batches with your tool of choice, then:

```rust
use rust_data_processing::transform::arrow::record_batches_to_dataset;
use rust_data_processing::types::Schema;
// schema must match the logical columns you need (Int64, Float64, Bool, Utf8).
let ds = record_batches_to_dataset(&[batch1, batch2], &schema)?;
```

See rustdoc on `record_batches_to_dataset` for schema alignment rules.

### 3. Python

Use **`deltalake`** or **`pyiceberg`** to scan a table to **PyArrow**, write **Parquet**, then `rust_data_processing.ingest_from_path` on that Parquet path; or serialize batches and use project-specific glue if you need in-process Arrow.

## When to use Spark / Databricks

Use a cluster engine for **large** tables, **ACID** maintenance, **ZORDER**, **liquid clustering**, **Iceberg branching**, or **governance** features. Use this library for **local** QA, transforms, validation, and **smaller** extracts you land as files.