Skip to main content

ByteDictionary

Struct ByteDictionary 

Source
pub struct ByteDictionary { /* private fields */ }
Expand description

Deterministic byte-keyed dictionary.

Owns:

  • a ByteStringPool storing every distinct byte sequence
  • a BTreeMap<Vec<u8>, u64> mapping bytes → code (deterministic iteration; no randomized hashing)
  • an ordering policy controlling code assignment
  • a frozen flag preventing inadvertent extension

§Code stability across runs

For a given (input bytes sequence, ordering) pair, the dictionary produces bit-identical codes on every machine, every architecture, every locale. The only inputs to the code map are the raw bytes and the ordering policy.

Implementations§

Source§

impl ByteDictionary

Source

pub fn new() -> Self

Empty dictionary with FirstSeen ordering. Not frozen.

Source

pub fn with_ordering(ordering: CategoryOrdering) -> Self

Empty dictionary with the given ordering. For Explicit, codes are pre-populated 0..values.len() immediately.

Source

pub fn from_explicit(values: Vec<Vec<u8>>) -> Result<Self, ByteDictError>

Build an Explicit-ordered dictionary from the given values. The dictionary is returned NOT frozen — the caller can choose to freeze it. Returns Err if values contains duplicates.

Source

pub fn len(&self) -> usize

Number of categories.

Source

pub fn is_empty(&self) -> bool

True if no categories.

Source

pub fn is_frozen(&self) -> bool

True if the dictionary is frozen.

Source

pub fn ordering(&self) -> &CategoryOrdering

The ordering policy.

Source

pub fn freeze(&mut self)

Freeze the dictionary. After freezing, intern returns Err(Frozen) for unknown values; intern_with_policy honours UnknownCategoryPolicy. Lookups continue to work.

Source

pub fn pool(&self) -> &ByteStringPool

Direct read access to the underlying pool.

Source

pub fn get(&self, code: u64) -> Option<&[u8]>

Resolve a code back to its byte payload. Returns None for out-of-range codes.

Source

pub fn lookup(&self, bytes: &[u8]) -> Option<u64>

Look up a byte sequence. Does not extend the dictionary. Returns None if not present.

v3 Phase 7: when the dictionary has been seal_for_lookup()-ed, the primary lookup goes through the DHarht Memory profile. Falls back to the BTreeMap for unsealed dictionaries (which is also the canonical iteration source).

Source

pub fn seal_for_lookup(&mut self)

v3 Phase 7: build the DHarht lookup accelerator and seal it. After this call, lookup() routes through the DHarht and the dictionary should be treated as read-only for performance reasons (mutations are not blocked but invalidate the accelerator — they trigger a debug-build assertion). The BTreeMap lookup table is preserved for canonical iteration and for range-style queries that the DHarht does not support.

Spec compliance:

  • splitmix64 deterministic scattering ✓
  • 256 shards (power of two) ✓
  • sealed sparse 16-bit front directory ✓
  • MicroBucket16 with deterministic BTreeMap overflow on bucket > 16 (no silent entry loss) ✓
  • full key equality on every successful lookup ✓
Source

pub fn is_lookup_sealed(&self) -> bool

True if the dictionary has been sealed with a DHarht accelerator. Distinct from the frozen flag (which controls extension, not lookup backend).

Source

pub fn dharht_overflow_count(&self) -> u64

Diagnostic: number of entries that overflowed to the per-shard BTreeMap fallback in the DHarht. Always 0 if not sealed.

Source

pub fn seal_with_u64_hash_index(&mut self)

v3 Phase 11: build a u64-hash content-addressed lookup index using SealedU64Map (DHarhtMemory profile). After this call, lookup_by_hash(h) returns the dictionary code for whichever byte sequence hashes to h. The hash function is the workspace’s deterministic crate::detcoll::hash_bytes so the caller’s pre-computed hash and the index agree byte-for-byte.

Use case: snapshot diffing, content-addressed storage, reproducibility-critical pipelines where the hash is the canonical identifier.

This is independent of seal_for_lookup() — you can call either, both, or neither. Both indices, when built, are mutually consistent: they reference the same code space.

Source

pub fn is_hash_indexed(&self) -> bool

True if the u64-hash index has been built.

Source

pub fn lookup_by_hash(&self, hash: u64) -> Option<u64>

Look up a code by the deterministic u64 hash of its bytes. Returns None if the hash is unknown OR if the hash index has not been built (call seal_with_u64_hash_index() first).

Hash collision safety: this is a hash-only lookup with no full byte equality check. Two distinct byte sequences hashing to the same u64 would return one of the two codes — that’s O(2^-64) for splitmix64-mixed hashes (well-distributed inputs). For safety-critical paths use lookup_by_hash_verify which carries the original bytes and verifies.

Source

pub fn lookup_by_hash_verify(&self, hash: u64, bytes: &[u8]) -> Option<u64>

Hash-keyed lookup with explicit byte verification. Returns the code only if both (a) the hash maps to a known code AND (b) the stored bytes for that code match bytes exactly. This closes the O(2^-64) collision window of lookup_by_hash.

Source

pub fn intern(&mut self, bytes: &[u8]) -> Result<u64, ByteDictError>

Intern a byte sequence. If not present and the dictionary is not frozen, assigns a new code (only valid under FirstSeen / ExtendDictionary-ish flows). Under Lexical ordering this method works in streaming mode — codes are assigned in encounter order, then seal_lexical() re-sorts them at the end. For Explicit, intern errors on unknown values (the explicit list is the authority).

Source

pub fn intern_with_policy( &mut self, bytes: &[u8], policy: &UnknownCategoryPolicy, ) -> Result<InternedCode, ByteDictError>

Intern with explicit unknown policy. This is the inference-time API: it does not require the dictionary be unfrozen.

Source

pub fn seal_lexical(&mut self) -> Vec<u64>

Re-assign codes lexicographically (only useful when ordering is Lexical). Returns the permutation old_code → new_code so callers can rewrite their code arrays.

For FirstSeen and Explicit, this is a no-op and returns the identity permutation.

Source

pub fn iter(&self) -> impl Iterator<Item = (u64, &[u8])> + '_

Iterate (code, bytes) pairs in code order. Deterministic.

Trait Implementations§

Source§

impl Clone for ByteDictionary

Source§

fn clone(&self) -> ByteDictionary

Returns a duplicate of the value. Read more
1.0.0 (const: unstable) · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for ByteDictionary

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for ByteDictionary

Source§

fn default() -> Self

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.