Expand description
Deterministic Adaptive Dictionary Engine — Phase 1 core (v3 brief).
A byte-first categorical engine designed to coexist with the existing
Column::Categorical and FctColumn types (which stay for backwards
compat). New code that needs a deterministic, memory-efficient
categorical column should prefer CategoricalColumn from this module.
§Design summary
ByteStringPool— append-onlyVec<u8>arena with stable(offset, len)handles. NoStringin the hot path.ByteStrView— opaque(offset, len)handle into a pool.AdaptiveCodes— 4-arm enum (U8/U16/U32/U64). Promotes deterministically when cardinality crosses 256 / 65 536 / 2³² boundaries.ByteDictionary—BTreeMap<Vec<u8>, u64>lookup,frozenflag,CategoryOrderingpolicy. Deterministic by construction.CategoricalColumn— codes + dictionary + optional null bitmap.
§Determinism contract
- All thresholds are integer-only. No float math anywhere.
- Lookup is
BTreeMap, notHashMap— no randomized hashing. - Lexical ordering uses raw byte comparison (
Vec<u8>::cmp), not Unicode-aware sort. Cross-machine reproducibility is guaranteed. - Code-width promotion is lazy — triggered only when the current arm physically cannot hold the next code (i.e., inserting code 256 into a U8 arm). It is not predictive.
intern()on a frozen dictionary returnsErr, never silently extends.- The same byte sequence interned in two fresh dictionaries with the
same
CategoryOrderingproduces bit-identical code sequences.
§What this module does NOT do (deferred)
- Wiring into TidyView verbs (Phase 2)
- Categorical-aware group_by/join (Phase 3)
- Replacement of existing
Column::Categorical/FctColumn - Language-level
.cjclbuiltins
§Layout
Public API at the top, helpers below, #[cfg(test)] mod tests at the
bottom. Inline unit tests pin every invariant in the determinism
contract; bolero fuzz lives in tests/bolero_fuzz/categorical_dictionary_fuzz.rs.
Structs§
- Byte
Dictionary - Deterministic byte-keyed dictionary.
- Byte
StrView - Opaque handle into a
ByteStringPool. Cheap to copy. - Byte
String Pool - Append-only byte arena. Each interned byte sequence gets a stable
ByteStrViewhandle whose(offset, len)survives all subsequent insertions (the underlyingVec<u8>may reallocate, but the indices into it are stable). - Categorical
Column - A categorical column: a vector of codes pointing into a shared
ByteDictionary, plus an optional null bitmap. - Categorical
Profile - Internal stats computed by
CategoricalColumn::profile(). Integer fields only — no float math, deterministic.
Enums§
- Adaptive
Codes - Adaptive-width code storage. Promotes to a wider arm only when the current arm physically cannot hold the next code:
- Byte
Dict Error - Category
Ordering - Policy for assigning codes to byte sequences during dictionary construction.
- Interned
Code - Outcome of
intern_with_policy. - Unknown
Category Policy - Policy for
intern_with_policywhen a byte sequence is not in the dictionary AND the dictionary isfrozen.