Skip to main content

similar/algorithms/
mod.rs

1//! Various diff (longest common subsequence) algorithms.
2//!
3//! The implementations of the algorithms in this module are relatively low
4//! level and expose the most generic bounds possible for the algorithm.  To
5//! use them you would typically use the higher level API if possible but
6//! direct access to these algorithms can be useful in some cases.
7//!
8//! All these algorithms provide a `diff` function which takes two indexable
9//! objects (for instance slices) and a [`DiffHook`].  As the
10//! diff is generated the diff hook is invoked.  Note that the diff hook does
11//! not get access to the actual values but only the indexes.  This is why the
12//! diff hook is not used outside of the raw algorithm implementations as for
13//! most situations access to the values is useful of required.
14//!
15//! The algorithms module really is the most low-level module in similar and
16//! generally not the place to start.
17//!
18//! # Which Algorithm Should You Use?
19//!
20//! For most users, start with **[`Algorithm::Myers`]**.  It's the default for a
21//! reason: good general quality, good performance, and robust behavior with
22//! deadlines.
23//!
24//! If you need to tune behavior, use this rule of thumb:
25//!
26//! - **[`Algorithm::Myers`]** (default):
27//!   best all-around choice for mixed workloads.
28//! - **[`Algorithm::Patience`]**:
29//!   often more human-readable for refactors and reordered blocks, especially
30//!   when there are unique lines/tokens to anchor on.
31//! - **[`Algorithm::Histogram`]**:
32//!   good for noisy/repetitive inputs (logs, generated text, repeated lines),
33//!   as it prefers low-frequency anchors over common noise.
34//! - **[`Algorithm::Hunt`]**:
35//!   useful when matching pairs are relatively sparse and you want
36//!   Hunt/LCS-style anchoring; can use more memory on highly repetitive input.
37//! - **[`Algorithm::Lcs`]**:
38//!   mainly for small inputs, debugging, or reference behavior. It has
39//!   `O(N*M)` time/space and is usually not a good default at scale.
40//!
41//! Trade-offs to keep in mind:
42//!
43//! - `Patience` can lose its edge when there are few unique anchors.
44//! - `Histogram` may prefer readability-oriented anchors over minimal edit
45//!   scripts.
46//! - `Hunt` can degrade on inputs with many repeated matches (`R` grows large).
47//! - `Lcs` scales poorly and produces weak approximations when deadlines hit.
48//!
49//! # Heuristics
50//!
51//! Algorithm entrypoints in this module (`diff` / `diff_deadline`) are
52//! heuristic-enabled.  In practice that means they use practical shortcuts to
53//! keep difficult inputs fast while still producing useful diff scripts.
54//!
55//! At a high level, the current heuristics are:
56//!
57//! - **Shared disjoint-range fast path** (all algorithms):
58//!   if two large ranges appear to have no common items, we skip expensive
59//!   search and emit a straight delete+insert replacement.
60//! - **Prefix/suffix trimming** (used widely):
61//!   matching runs at the beginning/end are emitted immediately so each
62//!   algorithm only works on the changed middle.
63//! - **Deadline-aware fallback behavior**:
64//!   when a deadline is provided, algorithms periodically check it and may fall
65//!   back to a simpler script instead of running too long.
66//! - **Algorithm-local anchor strategies**:
67//!   - **Patience** anchors on items unique in both sides.
68//!   - **Histogram** prefers low-frequency anchors and avoids very noisy lines.
69//!   - **Hunt** uses match lists and longest-increasing anchor chains.
70//! - **Myers-specific safeguards**:
71//!   Myers follows Eugene W. Myers' shortest-edit-script approach: it finds a
72//!   "middle snake" (a central diagonal run of equal items on an optimal edit
73//!   path) and recursively diffs the left and right sides around that split.
74//!   Beyond that classic middle-snake recursion, this implementation adds a
75//!   "front-anchor peel" for heavily unbalanced shifts (it probes a few small
76//!   one-sided skips near the start to find a long shared run, emits that
77//!   prefix anchor early, then recurses on the remaining tail) and an exact
78//!   small-side fallback when one side is tiny and the other is large.
79//!
80//! Some heuristic-enabled entrypoints may require stricter trait bounds than
81//! their raw counterpart (for example, shared heuristics that build hash-based
82//! lookups require [`Hash`] + [`Eq`]). If your values need to be computed lazily
83//! or compared via a derived key, wrap them in [`CachedLookup`]
84//! first. In the remaining cases, `diff_deadline_raw` is the compatibility path
85//! with minimal bounds.
86//!
87//! If you want to skip shared heuristics, each algorithm module provides
88//! `diff_deadline_raw`, which keeps that algorithm's minimal intrinsic bounds.
89//!
90//! The top-level dispatcher [`diff_deadline`] always calls the
91//! heuristic-enabled entrypoints and never calls raw variants.
92//!
93//! # Sequence Adapters
94//!
95//! Two helpers are available when your input is not already a plain slice:
96//!
97//! - [`CachedLookup`]: lazily computes and caches sequence items on first access.
98//! - [`IdentifyDistinct`]: eagerly remaps values to dense integer IDs.
99//!
100//! # Example
101//!
102//! This is a simple example that shows how you can calculate the difference
103//! between two sequences and capture the ops into a vector.
104//!
105//! ```rust
106//! use similar::algorithms::{Algorithm, Replace, Capture, diff_slices};
107//!
108//! let a = vec![1, 2, 3, 4, 5];
109//! let b = vec![1, 2, 3, 4, 7];
110//! let mut d = Replace::new(Capture::new());
111//! diff_slices(Algorithm::Myers, &mut d, &a, &b).unwrap();
112//! let ops = d.into_inner().into_ops();
113//! ```
114//!
115//! The above example is equivalent to using
116//! [`capture_diff_slices`](crate::capture_diff_slices).
117
118pub use crate::lookup::CachedLookup;
119
120mod capture;
121mod compact;
122mod hook;
123mod preflight;
124mod replace;
125pub(crate) mod utils;
126
127#[cfg(test)]
128use alloc::vec::Vec;
129use core::hash::Hash;
130use core::ops::{Index, Range};
131
132use crate::deadline_support::Instant;
133pub use capture::Capture;
134pub use compact::Compact;
135pub use hook::{DiffHook, NoFinishHook};
136pub use replace::Replace;
137pub use utils::IdentifyDistinct;
138
139#[doc(no_inline)]
140pub use crate::Algorithm;
141
142pub mod histogram;
143pub mod hunt;
144pub mod lcs;
145pub mod myers;
146pub mod patience;
147
148/// Creates a diff between old and new with the given algorithm.
149///
150/// Diffs `old`, between indices `old_range` and `new` between indices `new_range`.
151pub fn diff<Old, New, D>(
152    alg: Algorithm,
153    d: &mut D,
154    old: &Old,
155    old_range: Range<usize>,
156    new: &New,
157    new_range: Range<usize>,
158) -> Result<(), D::Error>
159where
160    Old: Index<usize> + ?Sized,
161    New: Index<usize> + ?Sized,
162    D: DiffHook,
163    Old::Output: Hash + Eq,
164    New::Output: PartialEq<Old::Output> + Hash + Eq,
165{
166    diff_deadline(alg, d, old, old_range, new, new_range, None)
167}
168
169/// Creates a diff between old and new with the given algorithm with deadline.
170///
171/// Diffs `old`, between indices `old_range` and `new` between indices `new_range`.
172///
173/// This diff is done with an optional deadline that defines the maximal
174/// execution time permitted before it bails and falls back to an approximation.
175/// Note that not all algorithms behave well if they reach the deadline (LCS
176/// for instance produces a very simplistic diff when the deadline is reached
177/// in all cases).
178pub fn diff_deadline<Old, New, D>(
179    alg: Algorithm,
180    d: &mut D,
181    old: &Old,
182    old_range: Range<usize>,
183    new: &New,
184    new_range: Range<usize>,
185    deadline: Option<Instant>,
186) -> Result<(), D::Error>
187where
188    Old: Index<usize> + ?Sized,
189    New: Index<usize> + ?Sized,
190    D: DiffHook,
191    Old::Output: Hash + Eq,
192    New::Output: PartialEq<Old::Output> + Hash + Eq,
193{
194    match alg {
195        Algorithm::Myers => myers::diff_deadline(d, old, old_range, new, new_range, deadline),
196        Algorithm::Patience => patience::diff_deadline(d, old, old_range, new, new_range, deadline),
197        Algorithm::Lcs => lcs::diff_deadline(d, old, old_range, new, new_range, deadline),
198        Algorithm::Hunt => hunt::diff_deadline(d, old, old_range, new, new_range, deadline),
199        Algorithm::Histogram => {
200            histogram::diff_deadline(d, old, old_range, new, new_range, deadline)
201        }
202    }
203}
204
205/// Shortcut for diffing slices with a specific algorithm.
206pub fn diff_slices<D, T>(alg: Algorithm, d: &mut D, old: &[T], new: &[T]) -> Result<(), D::Error>
207where
208    D: DiffHook,
209    T: Eq + Hash,
210{
211    diff(alg, d, old, 0..old.len(), new, 0..new.len())
212}
213
214/// Shortcut for diffing slices with a specific algorithm.
215pub fn diff_slices_deadline<D, T>(
216    alg: Algorithm,
217    d: &mut D,
218    old: &[T],
219    new: &[T],
220    deadline: Option<Instant>,
221) -> Result<(), D::Error>
222where
223    D: DiffHook,
224    T: Eq + Hash,
225{
226    diff_deadline(alg, d, old, 0..old.len(), new, 0..new.len(), deadline)
227}
228
229#[test]
230fn test_disjoint_fast_path_cross_type_guard() {
231    use crate::DiffOp;
232    use std::hash::{Hash, Hasher};
233
234    #[derive(Clone, Copy, Eq, PartialEq, Ord, PartialOrd)]
235    struct A(u32);
236
237    impl Hash for A {
238        fn hash<H: Hasher>(&self, state: &mut H) {
239            self.0.hash(state);
240        }
241    }
242
243    #[derive(Clone, Copy, Eq, PartialEq, Ord, PartialOrd)]
244    struct B(u32);
245
246    impl Hash for B {
247        fn hash<H: Hasher>(&self, state: &mut H) {
248            (self.0.wrapping_add(1_000_000)).hash(state);
249        }
250    }
251
252    impl PartialEq<A> for B {
253        fn eq(&self, other: &A) -> bool {
254            self.0 == other.0
255        }
256    }
257
258    let n = 512u32;
259    let old: Vec<A> = (0..n).map(A).collect();
260    let mut new: Vec<B> = (0..n).map(B).collect();
261    new.rotate_left(1);
262
263    let ops = crate::capture_diff(Algorithm::Myers, &old, 0..old.len(), &new, 0..new.len());
264    let equal_len = ops
265        .iter()
266        .map(|op| match op {
267            DiffOp::Equal { len, .. } => *len,
268            _ => 0,
269        })
270        .sum::<usize>();
271
272    assert_eq!(equal_len, n as usize - 1);
273}