Skip to main content

Module byte_dict

Module byte_dict 

Source
Expand description

Deterministic Adaptive Dictionary Engine — Phase 1 core (v3 brief).

A byte-first categorical engine designed to coexist with the existing Column::Categorical and FctColumn types (which stay for backwards compat). New code that needs a deterministic, memory-efficient categorical column should prefer CategoricalColumn from this module.

§Design summary

  • ByteStringPool — append-only Vec<u8> arena with stable (offset, len) handles. No String in the hot path.
  • ByteStrView — opaque (offset, len) handle into a pool.
  • AdaptiveCodes — 4-arm enum (U8/U16/U32/U64). Promotes deterministically when cardinality crosses 256 / 65 536 / 2³² boundaries.
  • ByteDictionaryBTreeMap<Vec<u8>, u64> lookup, frozen flag, CategoryOrdering policy. Deterministic by construction.
  • CategoricalColumn — codes + dictionary + optional null bitmap.

§Determinism contract

  1. All thresholds are integer-only. No float math anywhere.
  2. Lookup is BTreeMap, not HashMap — no randomized hashing.
  3. Lexical ordering uses raw byte comparison (Vec<u8>::cmp), not Unicode-aware sort. Cross-machine reproducibility is guaranteed.
  4. Code-width promotion is lazy — triggered only when the current arm physically cannot hold the next code (i.e., inserting code 256 into a U8 arm). It is not predictive.
  5. intern() on a frozen dictionary returns Err, never silently extends.
  6. The same byte sequence interned in two fresh dictionaries with the same CategoryOrdering produces bit-identical code sequences.

§What this module does NOT do (deferred)

  • Wiring into TidyView verbs (Phase 2)
  • Categorical-aware group_by/join (Phase 3)
  • Replacement of existing Column::Categorical / FctColumn
  • Language-level .cjcl builtins

§Layout

Public API at the top, helpers below, #[cfg(test)] mod tests at the bottom. Inline unit tests pin every invariant in the determinism contract; bolero fuzz lives in tests/bolero_fuzz/categorical_dictionary_fuzz.rs.

Structs§

ByteDictionary
Deterministic byte-keyed dictionary.
ByteStrView
Opaque handle into a ByteStringPool. Cheap to copy.
ByteStringPool
Append-only byte arena. Each interned byte sequence gets a stable ByteStrView handle whose (offset, len) survives all subsequent insertions (the underlying Vec<u8> may reallocate, but the indices into it are stable).
CategoricalColumn
A categorical column: a vector of codes pointing into a shared ByteDictionary, plus an optional null bitmap.
CategoricalProfile
Internal stats computed by CategoricalColumn::profile(). Integer fields only — no float math, deterministic.

Enums§

AdaptiveCodes
Adaptive-width code storage. Promotes to a wider arm only when the current arm physically cannot hold the next code:
ByteDictError
CategoryOrdering
Policy for assigning codes to byte sequences during dictionary construction.
InternedCode
Outcome of intern_with_policy.
UnknownCategoryPolicy
Policy for intern_with_policy when a byte sequence is not in the dictionary AND the dictionary is frozen.