pub struct ByteDictionary { /* private fields */ }Expand description
Deterministic byte-keyed dictionary.
Owns:
- a
ByteStringPoolstoring every distinct byte sequence - a
BTreeMap<Vec<u8>, u64>mapping bytes → code (deterministic iteration; no randomized hashing) - an
orderingpolicy controlling code assignment - a
frozenflag preventing inadvertent extension
§Code stability across runs
For a given (input bytes sequence, ordering) pair, the dictionary
produces bit-identical codes on every machine, every architecture,
every locale. The only inputs to the code map are the raw bytes and
the ordering policy.
Implementations§
Source§impl ByteDictionary
impl ByteDictionary
Sourcepub fn with_ordering(ordering: CategoryOrdering) -> Self
pub fn with_ordering(ordering: CategoryOrdering) -> Self
Empty dictionary with the given ordering. For Explicit, codes are
pre-populated 0..values.len() immediately.
Sourcepub fn from_explicit(values: Vec<Vec<u8>>) -> Result<Self, ByteDictError>
pub fn from_explicit(values: Vec<Vec<u8>>) -> Result<Self, ByteDictError>
Build an Explicit-ordered dictionary from the given values. The
dictionary is returned NOT frozen — the caller can choose to
freeze it. Returns Err if values contains duplicates.
Sourcepub fn ordering(&self) -> &CategoryOrdering
pub fn ordering(&self) -> &CategoryOrdering
The ordering policy.
Sourcepub fn freeze(&mut self)
pub fn freeze(&mut self)
Freeze the dictionary. After freezing, intern returns
Err(Frozen) for unknown values; intern_with_policy honours
UnknownCategoryPolicy. Lookups continue to work.
Sourcepub fn pool(&self) -> &ByteStringPool
pub fn pool(&self) -> &ByteStringPool
Direct read access to the underlying pool.
Sourcepub fn get(&self, code: u64) -> Option<&[u8]>
pub fn get(&self, code: u64) -> Option<&[u8]>
Resolve a code back to its byte payload. Returns None for
out-of-range codes.
Sourcepub fn lookup(&self, bytes: &[u8]) -> Option<u64>
pub fn lookup(&self, bytes: &[u8]) -> Option<u64>
Look up a byte sequence. Does not extend the dictionary. Returns
None if not present.
v3 Phase 7: when the dictionary has been seal_for_lookup()-ed,
the primary lookup goes through the DHarht Memory profile.
Falls back to the BTreeMap for unsealed dictionaries (which is
also the canonical iteration source).
Sourcepub fn seal_for_lookup(&mut self)
pub fn seal_for_lookup(&mut self)
v3 Phase 7: build the DHarht lookup accelerator and seal it.
After this call, lookup() routes through the DHarht and the
dictionary should be treated as read-only for performance
reasons (mutations are not blocked but invalidate the
accelerator — they trigger a debug-build assertion). The
BTreeMap lookup table is preserved for canonical iteration
and for range-style queries that the DHarht does not
support.
Spec compliance:
- splitmix64 deterministic scattering ✓
- 256 shards (power of two) ✓
- sealed sparse 16-bit front directory ✓
- MicroBucket16 with deterministic BTreeMap overflow on bucket > 16 (no silent entry loss) ✓
- full key equality on every successful lookup ✓
Sourcepub fn is_lookup_sealed(&self) -> bool
pub fn is_lookup_sealed(&self) -> bool
True if the dictionary has been sealed with a DHarht
accelerator. Distinct from the frozen flag (which controls
extension, not lookup backend).
Sourcepub fn dharht_overflow_count(&self) -> u64
pub fn dharht_overflow_count(&self) -> u64
Diagnostic: number of entries that overflowed to the per-shard
BTreeMap fallback in the DHarht. Always 0 if not sealed.
Sourcepub fn seal_with_u64_hash_index(&mut self)
pub fn seal_with_u64_hash_index(&mut self)
v3 Phase 11: build a u64-hash content-addressed lookup index
using SealedU64Map (DHarhtMemory profile). After this call,
lookup_by_hash(h) returns the dictionary code for whichever
byte sequence hashes to h. The hash function is the
workspace’s deterministic crate::detcoll::hash_bytes so the
caller’s pre-computed hash and the index agree byte-for-byte.
Use case: snapshot diffing, content-addressed storage, reproducibility-critical pipelines where the hash is the canonical identifier.
This is independent of seal_for_lookup() — you can call
either, both, or neither. Both indices, when built, are
mutually consistent: they reference the same code space.
Sourcepub fn is_hash_indexed(&self) -> bool
pub fn is_hash_indexed(&self) -> bool
True if the u64-hash index has been built.
Sourcepub fn lookup_by_hash(&self, hash: u64) -> Option<u64>
pub fn lookup_by_hash(&self, hash: u64) -> Option<u64>
Look up a code by the deterministic u64 hash of its bytes.
Returns None if the hash is unknown OR if the hash index has
not been built (call seal_with_u64_hash_index() first).
Hash collision safety: this is a hash-only lookup with no
full byte equality check. Two distinct byte sequences hashing
to the same u64 would return one of the two codes — that’s
O(2^-64) for splitmix64-mixed hashes (well-distributed
inputs). For safety-critical paths use lookup_by_hash_verify
which carries the original bytes and verifies.
Sourcepub fn lookup_by_hash_verify(&self, hash: u64, bytes: &[u8]) -> Option<u64>
pub fn lookup_by_hash_verify(&self, hash: u64, bytes: &[u8]) -> Option<u64>
Hash-keyed lookup with explicit byte verification. Returns the
code only if both (a) the hash maps to a known code AND (b) the
stored bytes for that code match bytes exactly. This closes
the O(2^-64) collision window of lookup_by_hash.
Sourcepub fn intern(&mut self, bytes: &[u8]) -> Result<u64, ByteDictError>
pub fn intern(&mut self, bytes: &[u8]) -> Result<u64, ByteDictError>
Intern a byte sequence. If not present and the dictionary is not
frozen, assigns a new code (only valid under FirstSeen /
ExtendDictionary-ish flows). Under Lexical ordering this method
works in streaming mode — codes are assigned in encounter order,
then seal_lexical() re-sorts them at the end. For Explicit,
intern errors on unknown values (the explicit list is the
authority).
Sourcepub fn intern_with_policy(
&mut self,
bytes: &[u8],
policy: &UnknownCategoryPolicy,
) -> Result<InternedCode, ByteDictError>
pub fn intern_with_policy( &mut self, bytes: &[u8], policy: &UnknownCategoryPolicy, ) -> Result<InternedCode, ByteDictError>
Intern with explicit unknown policy. This is the inference-time API: it does not require the dictionary be unfrozen.
Sourcepub fn seal_lexical(&mut self) -> Vec<u64>
pub fn seal_lexical(&mut self) -> Vec<u64>
Re-assign codes lexicographically (only useful when ordering is
Lexical). Returns the permutation old_code → new_code so
callers can rewrite their code arrays.
For FirstSeen and Explicit, this is a no-op and returns the
identity permutation.
Trait Implementations§
Source§impl Clone for ByteDictionary
impl Clone for ByteDictionary
Source§fn clone(&self) -> ByteDictionary
fn clone(&self) -> ByteDictionary
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for ByteDictionary
impl Debug for ByteDictionary
Auto Trait Implementations§
impl Freeze for ByteDictionary
impl RefUnwindSafe for ByteDictionary
impl Send for ByteDictionary
impl Sync for ByteDictionary
impl Unpin for ByteDictionary
impl UnsafeUnpin for ByteDictionary
impl UnwindSafe for ByteDictionary
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more