Skip to main content

Module dataset_plan

Module dataset_plan 

Source
Expand description

Phase 1 — Deterministic ML training-data plan.

DatasetPlan is a small, immutable composition over a TidyView that adds the four ML-specific concerns missing from the data engine:

  1. Feature/label column selection with deterministic ordering.
  2. Encoding: float / int / bool / categorical → f64 features.
  3. Train/val/test splits, sequential or hashed-by-row.
  4. Batching with optional seeded SplitMix64 shuffle, materializing each batch into a row-major Tensor.

Phase 1 is Rust-only; not yet exposed to .cjcl. That’s Phase 3. Phase 6 will wire plan_hash into a training manifest — for now the field is reserved (Option<[u8; 32]>, always None).

§Determinism contract

  • Row IDs are always ascending u32 by default; TidyView already guarantees this for the underlying selection.
  • Shuffles use cjc_repro::Rng::seeded(seed) (SplitMix64) with Fisher-Yates over the split’s row vector.
  • Hashed splits use the fixed splitmix64 mixer keyed by row ^ seed.
  • Categorical dictionaries are built over all source rows (not just train) so val/test see codes consistent with train, then frozen before any batch is materialized.
  • Tensor materialization is row-major; no reductions, no FMA — bit copies only.

§Reuse map

NeedExisting primitive
Filter / project upstreamTidyView::filter, TidyView::select
Row maskAdaptiveSelection inside the TidyView
Categorical encodingByteDictionary::intern + freeze
Column-name → encoding mapdetcoll::SortedVecMap
Seeded RNGcjc_repro::Rng::seeded (SplitMix64)

Structs§

BatchIterator
BatchSpec
DatasetPlan
Immutable training-data plan. Cheap to clone (TidyView holds Rc<DataFrame>).
MaterializedBatch

Enums§

DatasetError
EncodingSpec
Per-column encoding directive. Each feature/label column must have one and only one of these. Phase 1 supports four encodings; richer schemes (one-hot, embedding lookup) are deferred to Phase 3.
Split
SplitSpec