Expand description
Phase 1 — Deterministic ML training-data plan.
DatasetPlan is a small, immutable composition over a TidyView
that adds the four ML-specific concerns missing from the data engine:
- Feature/label column selection with deterministic ordering.
- Encoding: float / int / bool / categorical →
f64features. - Train/val/test splits, sequential or hashed-by-row.
- Batching with optional seeded SplitMix64 shuffle, materializing
each batch into a row-major
Tensor.
Phase 1 is Rust-only; not yet exposed to .cjcl. That’s Phase 3. Phase
6 will wire plan_hash into a training manifest — for now the field is
reserved (Option<[u8; 32]>, always None).
§Determinism contract
- Row IDs are always ascending
u32by default;TidyViewalready guarantees this for the underlying selection. - Shuffles use
cjc_repro::Rng::seeded(seed)(SplitMix64) with Fisher-Yates over the split’s row vector. - Hashed splits use the fixed
splitmix64mixer keyed byrow ^ seed. - Categorical dictionaries are built over all source rows (not just train) so val/test see codes consistent with train, then frozen before any batch is materialized.
- Tensor materialization is row-major; no reductions, no FMA — bit copies only.
§Reuse map
| Need | Existing primitive |
|---|---|
| Filter / project upstream | TidyView::filter, TidyView::select |
| Row mask | AdaptiveSelection inside the TidyView |
| Categorical encoding | ByteDictionary::intern + freeze |
| Column-name → encoding map | detcoll::SortedVecMap |
| Seeded RNG | cjc_repro::Rng::seeded (SplitMix64) |
Structs§
- Batch
Iterator - Batch
Spec - Dataset
Plan - Immutable training-data plan. Cheap to clone (
TidyViewholdsRc<DataFrame>). - Materialized
Batch
Enums§
- Dataset
Error - Encoding
Spec - Per-column encoding directive. Each feature/label column must have one and only one of these. Phase 1 supports four encodings; richer schemes (one-hot, embedding lookup) are deferred to Phase 3.
- Split
- Split
Spec