Module dataset_plan

Expand description

Phase 1 — Deterministic ML training-data plan.

DatasetPlan is a small, immutable composition over a TidyView that adds the four ML-specific concerns missing from the data engine:

Feature/label column selection with deterministic ordering.
Encoding: float / int / bool / categorical → f64 features.
Train/val/test splits, sequential or hashed-by-row.
Batching with optional seeded SplitMix64 shuffle, materializing each batch into a row-major Tensor.

Phase 1 is Rust-only; not yet exposed to .cjcl. That’s Phase 3. Phase 6 will wire plan_hash into a training manifest — for now the field is reserved (Option<[u8; 32]>, always None).

§Determinism contract

Row IDs are always ascending u32 by default; TidyView already guarantees this for the underlying selection.
Shuffles use cjc_repro::Rng::seeded(seed) (SplitMix64) with Fisher-Yates over the split’s row vector.
Hashed splits use the fixed splitmix64 mixer keyed by row ^ seed.
Categorical dictionaries are built over all source rows (not just train) so val/test see codes consistent with train, then frozen before any batch is materialized.
Tensor materialization is row-major; no reductions, no FMA — bit copies only.

§Reuse map

Need	Existing primitive
Filter / project upstream	`TidyView::filter`, `TidyView::select`
Row mask	`AdaptiveSelection` inside the TidyView
Categorical encoding	`ByteDictionary::intern` + `freeze`
Column-name → encoding map	`detcoll::SortedVecMap`
Seeded RNG	`cjc_repro::Rng::seeded` (SplitMix64)

Structs§

BatchIterator
BatchSpec
DatasetPlan: Immutable training-data plan. Cheap to clone (TidyView holds Rc<DataFrame>).
MaterializedBatch

Enums§

DatasetError
EncodingSpec: Per-column encoding directive. Each feature/label column must have one and only one of these. Phase 1 supports four encodings; richer schemes (one-hot, embedding lookup) are deferred to Phase 3.
Split
SplitSpec

Module dataset_plan

Module dataset_plan Copy item path

§Determinism contract

§Reuse map

Structs§

Enums§

Module dataset_plan