ralph_workflow/json_parser/deduplication.rs
1//! Delta deduplication using KMP and Rolling Hash algorithms.
2//!
3//! This module provides efficient deduplication for streaming deltas using:
4//! - **Rolling Hash (Rabin-Karp)**: Fast O(n) filtering to eliminate impossible matches
5//! - **KMP (Knuth-Morris-Pratt)**: O(n+m) verification for exact substring matching
6//! - **Strong Overlap Detection**: Thresholds and boundary checks to prevent false positives
7//!
8//! # Enhanced Deduplication
9//!
10//! The enhanced algorithm uses multiple layers of validation to prevent false positives:
11//!
12//! 1. **Rolling Hash Filter**: Fast O(n) check to eliminate impossible matches
13//! 2. **KMP Verification**: O(n+m) confirmation of actual substring match
14//! 3. **Overlap Threshold**: Only dedupe when overlap >= 30 chars AND >= 50% of delta
15//! 4. **Boundary Sanity**: Ensure overlap ends at whitespace/punctuation/newline
16//! 5. **Short Chunk Protection**: Chunks < 20 chars never deduped unless exact match
17//!
18//! # Architecture
19//!
20//! ```text
21//! Incoming Delta
22//! │
23//! ▼
24//! ┌─────────────────────┐
25//! │ Rolling Hash Check │ ◄── Compute hash of delta, compare against
26//! │ (Rabin-Karp) │ sliding window hashes of accumulated content
27//! └──────────┬──────────┘
28//! │
29//! ┌──────┴──────┐
30//! │ Hash Match? │
31//! └──────┬──────┘
32//! No │ Yes
33//! │ │
34//! ▼ ▼
35//! Accept ┌─────────────────┐
36//! Delta │ KMP Verification│ ◄── Confirm actual substring match
37//! └────────┬────────┘
38//! │
39//! ┌──────┴──────┐
40//! │True Match? │
41//! └──────┬──────┘
42//! No │ Yes
43//! │ │
44//! ▼ ▼
45//! Accept ┌─────────────────────┐
46//! Delta │ Strong Overlap Check│ ◄── >= 30 chars, >= 50%, safe boundary
47//! └──────────┬──────────┘
48//! │
49//! ┌──────┴──────┐
50//! │Measures? │
51//! └──────┬──────┘
52//! No │ Yes
53//! │ │
54//! ▼ ▼
55//! Accept Extract New
56//! Delta Portion Only
57//! ```
58
59// Threshold configuration and overlap detection
60include!("deduplication/thresholds.rs");
61
62// Rolling hash window for fast substring detection
63include!("deduplication/rolling_hash.rs");
64
65// KMP matcher for exact substring verification (test-only)
66include!("deduplication/kmp_matcher.rs");
67
68// Delta deduplicator orchestration
69include!("deduplication/deduplicator.rs");
70
71// Tests
72include!("deduplication/tests.rs");