rust-data-processing 0.1.1

Schema-first ingestion (CSV, JSON, Parquet, Excel) into an in-memory DataSet, plus Polars-backed pipelines, SQL, profiling, validation, and map/reduce-style processing.
# Developer README (rust-data-processing)

This doc is for contributors working inside the repo. Keep it in sync with the actual module structure under `src/`
and the Phase 1 tracker in `Planning/PHASE1_PLAN.md`.

## Quick navigation

- **Published documentation (CI + Pages)**: `Planning/DOCUMENTATION.md` (rustdoc + pdoc, GitHub Pages, local `./scripts/build_docs.ps1 -All`)
- **Issue triage (labels, popular-bug query)**: `Planning/ISSUE_TRIAGE.md`
- **User-facing examples**: root `README.md` (cookbook + map/reduce aggregates); `API.md` (full aggregate / `group_by` snippets)
- **Reduce / aggregate semantics**: `Planning/REDUCE_AGG_SEMANTICS.md`
- **Public API surface**: `src/lib.rs` (module exports + default-on `sql`)
- **Core data model**: `src/types.rs` (`Schema`, `Field`, `DataType`, `DataSet`, `Value`)
- **Errors**: `src/error.rs` (`IngestionError`, `IngestionResult<T>`)
- **Ingestion**: `src/ingestion/`
- **DB ingestion (ConnectorX, feature-gated)**: `src/ingestion/db.rs`
- **DataFrame pipelines (Polars-backed)**: `src/pipeline/mod.rs`
- **In-memory transforms**: `src/processing/`
- **Parallel execution engine**: `src/execution/`
- **SQL module (feature: `sql`, enabled by default)**: `src/sql/mod.rs`
- **Transform spec**: `src/transform.rs` (`TransformSpec`, `TransformStep`)
- **Profiling**: `src/profiling/mod.rs`
- **Validation**: `src/validation/mod.rs`
- **Outliers**: `src/outliers/mod.rs`
- **CDC boundary types**: `src/cdc/mod.rs`
- **Benchmarks**: `benches/` (`pipelines`, `ingestion`, `map_reduce`, `profiling`, `validation`, `outliers`)
- **Scripts (Windows/PowerShell convenience)**: `scripts/`
- **Python bindings (PyO3 + maturin + uv)**: `python-wrapper/` (see `python-wrapper/README_DEV.md`)

## Module map (what does what)

### `types` (`src/types.rs`)

- **Purpose**: Engine-agnostic in-memory representation (`DataSet`) plus schema/value types.
- **Rule of thumb**: Keep this free of Polars/DataFusion types. Treat it as the stable “public model”.

### `error` (`src/error.rs`)

- **Purpose**: Single public error model for ingestion + engine delegation.
- **Guideline**: When an underlying engine/library produces a useful structured error, wrap it as
  `IngestionError::Engine { message, source }` so callers can keep the source chain.

### `ingestion` (`src/ingestion/`)

Files:

- `mod.rs`: module exports and public re-exports.
- `unified.rs`: **main entrypoints**:
  - `ingest_from_path(path, schema, options)`
  - `infer_schema_from_path(path, options)` (lossy inference)
  - `ingest_from_path_infer(path, options)` convenience path
  - `IngestionOptions`, `IngestionFormat`, `ExcelSheetSelection`, `IngestionRequest`
- `builder.rs`: `IngestionOptionsBuilder` (engine-agnostic config builder).
- `observability.rs`: observer traits + implementations (`StdErrObserver`, `FileObserver`, etc.).
- `polars_bridge.rs`: conversion helpers between `DataSet` and Polars plus Polars-error mapping.
- `csv.rs`, `json.rs`, `parquet.rs`, `excel.rs`: format-specific ingestion implementations.

Guidelines:

- **Public signatures** should stay in crate types (`Schema`, `DataSet`, `Value`, `IngestionOptions`).
- Prefer **delegation** to Polars reads/scans where it improves correctness/perf; fall back to custom logic where required.
- Keep ingestion behavior consistent across formats (especially **null semantics** and **schema mismatch** behavior).

### `pipeline` (`src/pipeline/mod.rs`)

- **Purpose**: DataFrame-centric pipeline API (Polars-backed) that keeps public signatures in crate types.
- **Key type**: `pipeline::DataFrame` which compiles to a Polars `LazyFrame` internally and collects to `DataSet`.
- **Guideline**: Avoid leaking Polars types in public APIs (internal fields can be Polars).

### `processing` (`src/processing/`)

- `filter.rs`: in-memory row filtering
- `map.rs`: in-memory row mapping
- `reduce.rs`: reductions (`ReduceOp` including mean/variance/std/sum-squares/L2/count-distinct)
- `multi.rs`: `feature_wise_mean_std`, `arg_max_row` / `arg_min_row`, `top_k_by_frequency`

Guidelines:

- Keep this layer **simple, deterministic**, and easy to benchmark.
- `reduce()` returns `None` for missing columns; preserve that behavior across other APIs where it makes sense.

### `execution` (`src/execution/`)

- **Purpose**: parallel execution wrapper around `processing`, with:
  - chunking
  - throttling (`max_in_flight_chunks`)
  - real-time metrics + observer hooks

Files:

- `mod.rs`: `ExecutionEngine`, `ExecutionOptions`, public observer/metrics exports.
- `observer.rs`: observer traits + default stderr observer.
- `semaphore.rs`: simple throttle primitive.

Guidelines:

- Keep scheduling behavior stable enough for benchmarks.
- When adding new parallel ops, ensure metrics/observer events remain coherent and useful.

### `sql` (`src/sql/mod.rs`) (feature: `sql`)

- **Purpose**: Polars-backed SQL wrapper (compiles SQL → Polars lazy plan).
- **Guideline**: Keep the rest of the crate Polars-first and DataFrame-centric; don’t let SQL drive public type leakage.

### `transform` (`src/transform.rs`)

- **Purpose**: Serde-friendly, end-user “mapping spec” (`TransformSpec`) for common ETL operations (rename/select/drop/cast/fill/derive),
  compiled to `pipeline::DataFrame` wrappers.
- **Guideline**: Keep public inputs/outputs in crate types (`Schema`, `DataSet`, `DataType`, `Value`) so Python and other bindings can wrap it.

### `profiling` (`src/profiling/mod.rs`)

- **Purpose**: Polars-backed column profiling (row count, nulls, distinct (non-null), numeric min/max/mean + quantiles) with deterministic sampling.
- **Guideline**: Reports should be stable, engine-agnostic structs with stable JSON/Markdown renderers.

### `validation` (`src/validation/mod.rs`)

- **Purpose**: Validation DSL (`ValidationSpec`, `Check`) compiled to Polars expressions, producing stable reports with JSON/Markdown renderers.
- **Guideline**: Prefer “collect once” plans; keep example collection bounded (`max_examples` style).

### `outliers` (`src/outliers/mod.rs`)

- **Purpose**: Outlier detection primitives (IQR / z-score / MAD) with explainable outputs and stable renderers.
- **Guideline**: Keep results explainable (stats + thresholds + examples) and deterministic under `SamplingMode`.

### `cdc` (`src/cdc/mod.rs`)

- **Purpose**: Dependency-free boundary types for CDC events and a batch-first source trait (Phase 1 spike).
- **Guideline**: Do not add a heavy default CDC dependency; keep this as an interface boundary.

### `ingestion::db` (`src/ingestion/db.rs`) (feature: `db_connectorx`)

- **Purpose**: Direct DB ingestion via ConnectorX (DB → Arrow → `DataSet`).
- **Guideline**: Keep API read-only and minimal; document platform prerequisites (e.g., OpenSSL toolchain constraints on some platforms).

## Common workflows

### GitHub Actions (summary)

- **`.github/workflows/rust_ci.yml`** — on every PR and push to **`main`**: `fmt`, `clippy`, tests (ubuntu + Windows), plus **ubuntu** `cargo test --features ci_expanded` (no **`db_connectorx`** — avoids OpenSSL/Perl in CI).
- **`.github/workflows/rust_release.yml`** — on tag **`v*`** (commit must be on **`main`**): `cargo publish --dry-run` then `cargo publish`.
- Policy write-up: **`Planning/CI_DEPLOY_POLICY.md`**.

### Build + test

```powershell
cargo test
```

### Run feature-gated test suites

```powershell
# Same as GitHub Actions “expanded” job (no ConnectorX → no OpenSSL/Perl on this path)
cargo test --locked --features ci_expanded

# Deep tests (large fixtures)
./scripts/run_deep_tests.ps1

# Excel ingestion (reader)
cargo test --features excel

# Excel tests that generate an .xlsx at runtime
cargo test --features excel_test_writer

# DB ingestion (ConnectorX) — see “Windows: OpenSSL / perl” below
cargo test --features db_connectorx
```

### Windows: OpenSSL / “perl not found” / build hang

**`cargo test --all-features`** or **`--features db_connectorx`** enables **ConnectorX**, which pulls **OpenSSL**. On Windows, **openssl-sys** may try to **compile OpenSSL from source** and needs **Perl** (e.g. **Strawberry Perl**). A failed configure step can look like a **hang**; **incremental** builds may keep retrying until you **clean**.

**Recover:** `cargo clean` (or `cargo clean -p openssl-sys`), then prefer **`cargo test --locked --features ci_expanded`** for parity with CI. Use **`db_connectorx`** only when Perl/OpenSSL (or **WSL** / **Linux**) is set up.

### Run benchmarks

```powershell
cargo bench --bench pipelines
cargo bench --bench ingestion
cargo bench --bench map_reduce
# map_reduce bench includes: filter/map/sum, scalar mean/variance (mem vs Polars), feature_wise_mean_std,
# arg_max / top_k_by_frequency, and Polars group_by ML-style aggs
cargo bench --bench profiling
cargo bench --bench validation
cargo bench --bench outliers
```

Convenience wrapper (Windows):

```powershell
./scripts/run_benchmarks.ps1 -Quick
```

### Generate Rustdoc

```powershell
./scripts/build_docs.ps1
```

### Feature flags

- **Default**: SQL is enabled (Polars-backed).
- **Enable Excel ingestion**:

```powershell
cargo test --features excel
```

- **Enable DB ingestion (ConnectorX)**:

```powershell
cargo test --features db_connectorx
```

- **Disable default features (including SQL)**:

```powershell
cargo test --no-default-features
```

- **`ci_expanded`**: `deep_tests` + `excel_test_writer` + `arrow` + `serde_arrow` (matches **`rust_ci.yml`**; no **`db_connectorx`**).
- **Deep tests**:

```powershell
./scripts/run_deep_tests.ps1
```

  These run `tests/deep_tests.rs` and include **in-memory vs Polars** parity for `reduce`, **`feature_wise_mean_std`**, plus **`group_by`** / **arg max/min** / **top‑k** on Seattle CSV, `job_runs_sample.json`, and the Apache Parquet fixture (where applicable).

## Making changes safely (project conventions)

- **Start at the public API**: if the change affects users, begin in `src/lib.rs` and `API.md`/`README.md`.
- **Keep public types engine-agnostic**: don’t expose Polars types in public signatures unless we intentionally “graduate” them.
- **Error handling**:
  - Use `IngestionError::SchemaMismatch` for missing/invalid shape.
  - Use `IngestionError::ParseError` for cell-level parsing/type issues.
  - Use `IngestionError::Engine` to preserve underlying engine/library error sources.
- **Benchmarks & parity**:
  - If you change behavior in ingestion/pipelines, update or add benchmarks/tests to keep parity expectations explicit.
- **Docs upkeep**:
  - If you add/rename modules or public APIs, update **both** `README.md` and this file (`README_DEV.md`).
  - Keep the roadmap summary in `README.md` aligned with `Planning/PHASE1_PLAN.md`.