similar/algorithms/mod.rs
1//! Various diff (longest common subsequence) algorithms.
2//!
3//! The implementations of the algorithms in this module are relatively low
4//! level and expose the most generic bounds possible for the algorithm. To
5//! use them you would typically use the higher level API if possible but
6//! direct access to these algorithms can be useful in some cases.
7//!
8//! All these algorithms provide a `diff` function which takes two indexable
9//! objects (for instance slices) and a [`DiffHook`]. As the
10//! diff is generated the diff hook is invoked. Note that the diff hook does
11//! not get access to the actual values but only the indexes. This is why the
12//! diff hook is not used outside of the raw algorithm implementations as for
13//! most situations access to the values is useful of required.
14//!
15//! The algorithms module really is the most low-level module in similar and
16//! generally not the place to start.
17//!
18//! # Which Algorithm Should You Use?
19//!
20//! For most users, start with **[`Algorithm::Myers`]**. It's the default for a
21//! reason: good general quality, good performance, and robust behavior with
22//! deadlines.
23//!
24//! If you need to tune behavior, use this rule of thumb:
25//!
26//! - **[`Algorithm::Myers`]** (default):
27//! best all-around choice for mixed workloads.
28//! - **[`Algorithm::Patience`]**:
29//! often more human-readable for refactors and reordered blocks, especially
30//! when there are unique lines/tokens to anchor on.
31//! - **[`Algorithm::Histogram`]**:
32//! good for noisy/repetitive inputs (logs, generated text, repeated lines),
33//! as it prefers low-frequency anchors over common noise.
34//! - **[`Algorithm::Hunt`]**:
35//! useful when matching pairs are relatively sparse and you want
36//! Hunt/LCS-style anchoring; can use more memory on highly repetitive input.
37//! - **[`Algorithm::Lcs`]**:
38//! mainly for small inputs, debugging, or reference behavior. It has
39//! `O(N*M)` time/space and is usually not a good default at scale.
40//!
41//! Trade-offs to keep in mind:
42//!
43//! - `Patience` can lose its edge when there are few unique anchors.
44//! - `Histogram` may prefer readability-oriented anchors over minimal edit
45//! scripts.
46//! - `Hunt` can degrade on inputs with many repeated matches (`R` grows large).
47//! - `Lcs` scales poorly and produces weak approximations when deadlines hit.
48//!
49//! # Heuristics
50//!
51//! Algorithm entrypoints in this module (`diff` / `diff_deadline`) are
52//! heuristic-enabled. In practice that means they use practical shortcuts to
53//! keep difficult inputs fast while still producing useful diff scripts.
54//!
55//! At a high level, the current heuristics are:
56//!
57//! - **Shared disjoint-range fast path** (all algorithms):
58//! if two large ranges appear to have no common items, we skip expensive
59//! search and emit a straight delete+insert replacement.
60//! - **Prefix/suffix trimming** (used widely):
61//! matching runs at the beginning/end are emitted immediately so each
62//! algorithm only works on the changed middle.
63//! - **Deadline-aware fallback behavior**:
64//! when a deadline is provided, algorithms periodically check it and may fall
65//! back to a simpler script instead of running too long.
66//! - **Algorithm-local anchor strategies**:
67//! - **Patience** anchors on items unique in both sides.
68//! - **Histogram** prefers low-frequency anchors and avoids very noisy lines.
69//! - **Hunt** uses match lists and longest-increasing anchor chains.
70//! - **Myers-specific safeguards**:
71//! Myers follows Eugene W. Myers' shortest-edit-script approach: it finds a
72//! "middle snake" (a central diagonal run of equal items on an optimal edit
73//! path) and recursively diffs the left and right sides around that split.
74//! Beyond that classic middle-snake recursion, this implementation adds a
75//! "front-anchor peel" for heavily unbalanced shifts (it probes a few small
76//! one-sided skips near the start to find a long shared run, emits that
77//! prefix anchor early, then recurses on the remaining tail) and an exact
78//! small-side fallback when one side is tiny and the other is large.
79//!
80//! Some heuristic-enabled entrypoints may require stricter trait bounds than
81//! their raw counterpart (for example, shared heuristics that build hash-based
82//! lookups require [`Hash`] + [`Eq`]). If your values need to be computed lazily
83//! or compared via a derived key, wrap them in [`CachedLookup`]
84//! first. In the remaining cases, `diff_deadline_raw` is the compatibility path
85//! with minimal bounds.
86//!
87//! If you want to skip shared heuristics, each algorithm module provides
88//! `diff_deadline_raw`, which keeps that algorithm's minimal intrinsic bounds.
89//!
90//! The top-level dispatcher [`diff_deadline`] always calls the
91//! heuristic-enabled entrypoints and never calls raw variants.
92//!
93//! # Sequence Adapters
94//!
95//! Two helpers are available when your input is not already a plain slice:
96//!
97//! - [`CachedLookup`]: lazily computes and caches sequence items on first access.
98//! - [`IdentifyDistinct`]: eagerly remaps values to dense integer IDs.
99//!
100//! # Example
101//!
102//! This is a simple example that shows how you can calculate the difference
103//! between two sequences and capture the ops into a vector.
104//!
105//! ```rust
106//! use similar::algorithms::{Algorithm, Replace, Capture, diff_slices};
107//!
108//! let a = vec![1, 2, 3, 4, 5];
109//! let b = vec![1, 2, 3, 4, 7];
110//! let mut d = Replace::new(Capture::new());
111//! diff_slices(Algorithm::Myers, &mut d, &a, &b).unwrap();
112//! let ops = d.into_inner().into_ops();
113//! ```
114//!
115//! The above example is equivalent to using
116//! [`capture_diff_slices`](crate::capture_diff_slices).
117
118pub use crate::lookup::CachedLookup;
119
120mod capture;
121mod compact;
122mod hook;
123mod preflight;
124mod replace;
125pub(crate) mod utils;
126
127#[cfg(test)]
128use alloc::vec::Vec;
129use core::hash::Hash;
130use core::ops::{Index, Range};
131
132use crate::deadline_support::Instant;
133pub use capture::Capture;
134pub use compact::Compact;
135pub use hook::{DiffHook, NoFinishHook};
136pub use replace::Replace;
137pub use utils::IdentifyDistinct;
138
139#[doc(no_inline)]
140pub use crate::Algorithm;
141
142pub mod histogram;
143pub mod hunt;
144pub mod lcs;
145pub mod myers;
146pub mod patience;
147
148/// Creates a diff between old and new with the given algorithm.
149///
150/// Diffs `old`, between indices `old_range` and `new` between indices `new_range`.
151pub fn diff<Old, New, D>(
152 alg: Algorithm,
153 d: &mut D,
154 old: &Old,
155 old_range: Range<usize>,
156 new: &New,
157 new_range: Range<usize>,
158) -> Result<(), D::Error>
159where
160 Old: Index<usize> + ?Sized,
161 New: Index<usize> + ?Sized,
162 D: DiffHook,
163 Old::Output: Hash + Eq,
164 New::Output: PartialEq<Old::Output> + Hash + Eq,
165{
166 diff_deadline(alg, d, old, old_range, new, new_range, None)
167}
168
169/// Creates a diff between old and new with the given algorithm with deadline.
170///
171/// Diffs `old`, between indices `old_range` and `new` between indices `new_range`.
172///
173/// This diff is done with an optional deadline that defines the maximal
174/// execution time permitted before it bails and falls back to an approximation.
175/// Note that not all algorithms behave well if they reach the deadline (LCS
176/// for instance produces a very simplistic diff when the deadline is reached
177/// in all cases).
178pub fn diff_deadline<Old, New, D>(
179 alg: Algorithm,
180 d: &mut D,
181 old: &Old,
182 old_range: Range<usize>,
183 new: &New,
184 new_range: Range<usize>,
185 deadline: Option<Instant>,
186) -> Result<(), D::Error>
187where
188 Old: Index<usize> + ?Sized,
189 New: Index<usize> + ?Sized,
190 D: DiffHook,
191 Old::Output: Hash + Eq,
192 New::Output: PartialEq<Old::Output> + Hash + Eq,
193{
194 match alg {
195 Algorithm::Myers => myers::diff_deadline(d, old, old_range, new, new_range, deadline),
196 Algorithm::Patience => patience::diff_deadline(d, old, old_range, new, new_range, deadline),
197 Algorithm::Lcs => lcs::diff_deadline(d, old, old_range, new, new_range, deadline),
198 Algorithm::Hunt => hunt::diff_deadline(d, old, old_range, new, new_range, deadline),
199 Algorithm::Histogram => {
200 histogram::diff_deadline(d, old, old_range, new, new_range, deadline)
201 }
202 }
203}
204
205/// Shortcut for diffing slices with a specific algorithm.
206pub fn diff_slices<D, T>(alg: Algorithm, d: &mut D, old: &[T], new: &[T]) -> Result<(), D::Error>
207where
208 D: DiffHook,
209 T: Eq + Hash,
210{
211 diff(alg, d, old, 0..old.len(), new, 0..new.len())
212}
213
214/// Shortcut for diffing slices with a specific algorithm.
215pub fn diff_slices_deadline<D, T>(
216 alg: Algorithm,
217 d: &mut D,
218 old: &[T],
219 new: &[T],
220 deadline: Option<Instant>,
221) -> Result<(), D::Error>
222where
223 D: DiffHook,
224 T: Eq + Hash,
225{
226 diff_deadline(alg, d, old, 0..old.len(), new, 0..new.len(), deadline)
227}
228
229#[test]
230fn test_disjoint_fast_path_cross_type_guard() {
231 use crate::DiffOp;
232 use std::hash::{Hash, Hasher};
233
234 #[derive(Clone, Copy, Eq, PartialEq, Ord, PartialOrd)]
235 struct A(u32);
236
237 impl Hash for A {
238 fn hash<H: Hasher>(&self, state: &mut H) {
239 self.0.hash(state);
240 }
241 }
242
243 #[derive(Clone, Copy, Eq, PartialEq, Ord, PartialOrd)]
244 struct B(u32);
245
246 impl Hash for B {
247 fn hash<H: Hasher>(&self, state: &mut H) {
248 (self.0.wrapping_add(1_000_000)).hash(state);
249 }
250 }
251
252 impl PartialEq<A> for B {
253 fn eq(&self, other: &A) -> bool {
254 self.0 == other.0
255 }
256 }
257
258 let n = 512u32;
259 let old: Vec<A> = (0..n).map(A).collect();
260 let mut new: Vec<B> = (0..n).map(B).collect();
261 new.rotate_left(1);
262
263 let ops = crate::capture_diff(Algorithm::Myers, &old, 0..old.len(), &new, 0..new.len());
264 let equal_len = ops
265 .iter()
266 .map(|op| match op {
267 DiffOp::Equal { len, .. } => *len,
268 _ => 0,
269 })
270 .sum::<usize>();
271
272 assert_eq!(equal_len, n as usize - 1);
273}