Skip to main content

Module deduplication

Module deduplication 

Source
Expand description

Delta deduplication using KMP and Rolling Hash algorithms.

This module provides efficient deduplication for streaming deltas using:

  • Rolling Hash (Rabin-Karp): Fast O(n) filtering to eliminate impossible matches
  • KMP (Knuth-Morris-Pratt): O(n+m) verification for exact substring matching
  • Strong Overlap Detection: Thresholds and boundary checks to prevent false positives

§Enhanced Deduplication

The enhanced algorithm uses multiple layers of validation to prevent false positives:

  1. Rolling Hash Filter: Fast O(n) check to eliminate impossible matches
  2. KMP Verification: O(n+m) confirmation of actual substring match
  3. Overlap Threshold: Only dedupe when overlap >= 30 chars AND >= 50% of delta
  4. Boundary Sanity: Ensure overlap ends at whitespace/punctuation/newline
  5. Short Chunk Protection: Chunks < 20 chars never deduped unless exact match

§Architecture

Incoming Delta
      │
      ▼
┌─────────────────────┐
│  Rolling Hash Check │  ◄── Compute hash of delta, compare against
│  (Rabin-Karp)       │      sliding window hashes of accumulated content
└──────────┬──────────┘
           │
    ┌──────┴──────┐
    │ Hash Match? │
    └──────┬──────┘
      No   │   Yes
      │    │
      ▼    ▼
   Accept  ┌─────────────────┐
   Delta   │  KMP Verification│  ◄── Confirm actual substring match
           └────────┬────────┘
                    │
             ┌──────┴──────┐
             │True Match?  │
             └──────┬──────┘
               No   │   Yes
               │    │
               ▼    ▼
            Accept  ┌─────────────────────┐
            Delta   │ Strong Overlap Check│ ◄── >= 30 chars, >= 50%, safe boundary
                    └──────────┬──────────┘
                               │
                        ┌──────┴──────┐
                        │Measures?    │
                        └──────┬──────┘
                          No   │   Yes
                          │    │
                          ▼    ▼
                       Accept  Extract New
                       Delta   Portion Only

Structs§

DeltaDeduplicator
Delta deduplicator using rolling hash and KMP.
OverlapScore
Score representing the “strength” of an overlap.
OverlapThresholds
Configuration for strong overlap detection.
RealThresholdEnvironment
Production implementation that reads from actual environment.
RollingHashWindow
Rolling hash window for fast substring detection.

Traits§

ThresholdEnvironment
Trait for accessing environment variables.

Functions§

get_overlap_thresholds
get_overlap_thresholds_with_env
Testable variant that accepts an environment trait for dependency injection.