Expand description
Delta deduplication using KMP and Rolling Hash algorithms.
This module provides efficient deduplication for streaming deltas using:
- Rolling Hash (Rabin-Karp): Fast O(n) filtering to eliminate impossible matches
- KMP (Knuth-Morris-Pratt): O(n+m) verification for exact substring matching
- Strong Overlap Detection: Thresholds and boundary checks to prevent false positives
§Enhanced Deduplication
The enhanced algorithm uses multiple layers of validation to prevent false positives:
- Rolling Hash Filter: Fast O(n) check to eliminate impossible matches
- KMP Verification: O(n+m) confirmation of actual substring match
- Overlap Threshold: Only dedupe when overlap >= 30 chars AND >= 50% of delta
- Boundary Sanity: Ensure overlap ends at whitespace/punctuation/newline
- Short Chunk Protection: Chunks < 20 chars never deduped unless exact match
§Architecture
Incoming Delta
│
▼
┌─────────────────────┐
│ Rolling Hash Check │ ◄── Compute hash of delta, compare against
│ (Rabin-Karp) │ sliding window hashes of accumulated content
└──────────┬──────────┘
│
┌──────┴──────┐
│ Hash Match? │
└──────┬──────┘
No │ Yes
│ │
▼ ▼
Accept ┌─────────────────┐
Delta │ KMP Verification│ ◄── Confirm actual substring match
└────────┬────────┘
│
┌──────┴──────┐
│True Match? │
└──────┬──────┘
No │ Yes
│ │
▼ ▼
Accept ┌─────────────────────┐
Delta │ Strong Overlap Check│ ◄── >= 30 chars, >= 50%, safe boundary
└──────────┬──────────┘
│
┌──────┴──────┐
│Measures? │
└──────┬──────┘
No │ Yes
│ │
▼ ▼
Accept Extract New
Delta Portion OnlyStructs§
- Delta
Deduplicator - Delta deduplicator using rolling hash and KMP.
- Overlap
Score - Score representing the “strength” of an overlap.
- Overlap
Thresholds - Configuration for strong overlap detection.
- Real
Threshold Environment - Production implementation that reads from actual environment.
- Rolling
Hash Window - Rolling hash window for fast substring detection.
Traits§
- Threshold
Environment - Trait for accessing environment variables.
Functions§
- get_
overlap_ thresholds - get_
overlap_ thresholds_ with_ env - Testable variant that accepts an environment trait for dependency injection.