Skip to main content

ralph_workflow/json_parser/
deduplication.rs

1//! Delta deduplication using KMP and Rolling Hash algorithms.
2//!
3//! This module provides efficient deduplication for streaming deltas using:
4//! - **Rolling Hash (Rabin-Karp)**: Fast O(n) filtering to eliminate impossible matches
5//! - **KMP (Knuth-Morris-Pratt)**: O(n+m) verification for exact substring matching
6//! - **Strong Overlap Detection**: Thresholds and boundary checks to prevent false positives
7//!
8//! # Enhanced Deduplication
9//!
10//! The enhanced algorithm uses multiple layers of validation to prevent false positives:
11//!
12//! 1. **Rolling Hash Filter**: Fast O(n) check to eliminate impossible matches
13//! 2. **KMP Verification**: O(n+m) confirmation of actual substring match
14//! 3. **Overlap Threshold**: Only dedupe when overlap >= 30 chars AND >= 50% of delta
15//! 4. **Boundary Sanity**: Ensure overlap ends at whitespace/punctuation/newline
16//! 5. **Short Chunk Protection**: Chunks < 20 chars never deduped unless exact match
17//!
18//! # Architecture
19//!
20//! ```text
21//! Incoming Delta
22//!       │
23//!       ▼
24//! ┌─────────────────────┐
25//! │  Rolling Hash Check │  ◄── Compute hash of delta, compare against
26//! │  (Rabin-Karp)       │      sliding window hashes of accumulated content
27//! └──────────┬──────────┘
28//!            │
29//!     ┌──────┴──────┐
30//!     │ Hash Match? │
31//!     └──────┬──────┘
32//!       No   │   Yes
33//!       │    │
34//!       ▼    ▼
35//!    Accept  ┌─────────────────┐
36//!    Delta   │  KMP Verification│  ◄── Confirm actual substring match
37//!            └────────┬────────┘
38//!                     │
39//!              ┌──────┴──────┐
40//!              │True Match?  │
41//!              └──────┬──────┘
42//!                No   │   Yes
43//!                │    │
44//!                ▼    ▼
45//!             Accept  ┌─────────────────────┐
46//!             Delta   │ Strong Overlap Check│ ◄── >= 30 chars, >= 50%, safe boundary
47//!                     └──────────┬──────────┘
48//!                                │
49//!                         ┌──────┴──────┐
50//!                         │Measures?    │
51//!                         └──────┬──────┘
52//!                           No   │   Yes
53//!                           │    │
54//!                           ▼    ▼
55//!                        Accept  Extract New
56//!                        Delta   Portion Only
57//! ```
58
59// Threshold configuration and overlap detection
60include!("deduplication/thresholds.rs");
61
62// Rolling hash window for fast substring detection
63include!("deduplication/rolling_hash.rs");
64
65// KMP matcher for exact substring verification (test-only)
66include!("deduplication/kmp_matcher.rs");
67
68// Delta deduplicator orchestration
69include!("deduplication/deduplicator.rs");
70
71// Tests
72include!("deduplication/tests.rs");