Skip to main content

gix_imara_diff/
lib.rs

1// Modified for gitoxide from the upstream imara-diff crate.
2// Upstream source: git cat-file -p 32d1e45d3df061e6ccba6db7fdce92db29e345d8:src/lib.rs
3
4#![deny(missing_docs)]
5//! Imara-diff is a solid (imara in Swahili) diff library for Rust.
6//! Solid refers to the fact that imara-diff provides very good runtime performance even
7//! in pathological cases so that your application never appears to freeze while waiting on a diff.
8//! The performance improvements are achieved using battle tested heuristics used in gnu-diff and git
9//! that are known to yield fast runtime and performance.
10//!
11//! Imara-diff is also designed to be flexible so that it can be used with arbitrary collections and
12//! not just lists and strings and even allows reusing large parts of the computation when
13//! comparing the same file to multiple different files.
14//!
15//! Imara-diff provides two diff algorithms:
16//!
17//! * The linear-space variant of the well known [**Myers** algorithm](http://www.xmailserver.org/diff2.pdf)
18//! * The **Histogram** algorithm which is a variant of the patience diff algorithm.
19//!
20//! Myers algorithm has been enhanced with preprocessing and multiple heuristics to ensure fast runtime in pathological
21//! cases to avoid quadratic time complexity and closely matches the behavior of gnu-diff and git.
22//! The Histogram algorithm was originally ported from git but has been heavily optimized.
23//! The **Histogram algorithm outperforms Myers diff** by 10% - 100% across a **wide variety of workloads**.
24//!
25//! Imara-diffs algorithms have been benchmarked over a wide variety of real-world code.
26//! For example, while comparing multiple different Linux kernel versions, it performs up to 30 times better than the `similar` crate.
27//!
28//! # API Overview
29//!
30//! ## Preparing the input
31//! To compute a diff, an input sequence is required. `imara-diff` computes diffs on abstract
32//! sequences represented as a slice of IDs/tokens: [`Token`]. To create
33//! such a sequence from your input type (for example, text), the input needs to be interned.
34//! For that `imara-diff` provides utilities in the form of the [`InternedInput`] struct and
35//! the `TokenSource` trait to construct it. [`InternedInput`] contains the two sides of
36//! the diff (used while computing the diff). As well as the interner that allows mapping
37//! back tokens to their original data.
38//!
39//! The most common use case for diff is comparing text. `&str` implements `TokenSource`
40//! by default to segment the text into lines. So creating an input for a text-based diff usually
41//! looks something like the following:
42//!
43//! ```
44//! # use gix_imara_diff::InternedInput;
45//! #
46//! let before = "abc\ndef";
47//! let after = "abc\ndefg";
48//! let input = InternedInput::new(before, after);
49//! assert_eq!(input.interner[input.before[0]], "abc\n");
50//! ```
51//!
52//! Note that interning inputs is optional, and you could choose a different strategy
53//! for creating a sequence of tokens. Instead of using the [`Diff::compute`] function,
54//! [`Diff::compute_with`] can be used to provide a list of tokens directly, entirely
55//! bypassing the interning step.
56//!
57//! ## Computing the Diff
58//!
59//! A diff of two sequences is represented by the [`Diff`] struct and computed by
60//! [`Diff::compute`] / [`Diff::compute_with`]. An algorithm can also be chosen here.
61//! In most situations, [`Algorithm::Histogram`] is a good choice; refer to the docs
62//! of [`Algorithm`] for more details.
63//!
64//! After the initial computation, the diff can be *postprocessed*. If the diff is shown
65//! to a human in some way (even indirectly), you always want to use this.
66//!
67//! However, when only counting the number of changed tokens quickly, this can be skipped.
68//! The postprocessing allows you to provide your own
69//! heuristic for selecting a slider position. An indentation-based heuristic is provided,
70//! which is a good fit for all text-based line diffs. The internals of the heuristic are
71//! public, so a tweaked heuristic can be built on top.
72//!
73//! ```
74//! # use gix_imara_diff::{InternedInput, Diff, Algorithm};
75//! #
76//! let before = "abc\ndef";
77//! let after = "abc\ndefg";
78//! let input = InternedInput::new(before, after);
79//! let mut diff = Diff::compute(Algorithm::Histogram, &input);
80//! diff.postprocess_lines(&input);
81//! assert!(!diff.is_removed(0) && !diff.is_added(0));
82//! assert!(diff.is_removed(1) && diff.is_added(1));
83//! ```
84//!
85//! ## Accessing results
86//!
87//! [`Diff`] allows querying whether a particular position was removed/added on either
88//! side of the diff with [`Diff::is_removed`] / [`Diff::is_added`]. The number
89//! of additions/removals can be quickly counted with [`Diff::count_removals`] /
90//! [`Diff::count_additions`]. The most powerful/useful interface is the hunk iterator
91//! [`Diff::hunks`], which returns a list of additions/removals/modifications in the
92//! order that they appear in the input.
93//!
94//! Finally, when built with the `unified_diff` feature, this crate also provides a
95//! built-in unified diff/patch formatter similar to `git diff` or `diff -u`.
96//! Note that while the formatter has a decent amount of flexibility, it is fairly
97//! simplistic and not every formatting may be possible. It's meant to cover common
98//! situations but not cover every advanced use case. Instead, if you need more advanced
99//! printing, build your own printer on top of the [`Diff::hunks`] iterator; for that, you can
100//! take inspiration from the built-in printer implementation.
101//!
102//! ```
103//! # use gix_imara_diff::{InternedInput, Diff, Algorithm, BasicLineDiffPrinter, UnifiedDiffConfig};
104//! #
105//!
106//! let before = r#"fn foo() -> Bar {
107//!     let mut foo = 2;
108//!     foo *= 50;
109//!     println!("hello world")
110//! }
111//! "#;
112//!
113//! let after = r#"// lorem ipsum
114//! fn foo() -> Bar {
115//!     let mut foo = 2;
116//!     foo *= 50;
117//!     println!("hello world");
118//!     println!("{foo}");
119//! }
120//! // foo
121//! "#;
122//! let input = InternedInput::new(before, after);
123//! let mut diff = Diff::compute(Algorithm::Histogram, &input);
124//! diff.postprocess_lines(&input);
125//!
126//! assert_eq!(
127//!     diff.unified_diff(
128//!         &BasicLineDiffPrinter(&input.interner),
129//!         UnifiedDiffConfig::default(),
130//!         &input,
131//!     )
132//!     .to_string(),
133//!     r#"@@ -1,5 +1,8 @@
134//! +// lorem ipsum
135//!  fn foo() -> Bar {
136//!      let mut foo = 2;
137//!      foo *= 50;
138//! -    println!("hello world")
139//! +    println!("hello world");
140//! +    println!("{foo}");
141//!  }
142//! +// foo
143//! "#
144//! );
145//! ```
146
147use std::ops::Range;
148use std::slice;
149
150use crate::{
151    sources::words,
152    util::{strip_common_postfix, strip_common_prefix},
153};
154
155pub use crate::slider_heuristic::{IndentHeuristic, IndentLevel, NoSliderHeuristic, SliderHeuristic};
156pub use intern::{InternedInput, Interner, Token, TokenSource};
157#[cfg(feature = "unified_diff")]
158pub use unified_diff::{BasicLineDiffPrinter, UnifiedDiff, UnifiedDiffConfig, UnifiedDiffPrinter};
159
160mod histogram;
161mod intern;
162mod myers;
163mod postprocess;
164mod slider_heuristic;
165pub mod sources;
166#[cfg(test)]
167mod tests;
168#[cfg(feature = "unified_diff")]
169mod unified_diff;
170mod util;
171
172/// `imara-diff` supports multiple different algorithms
173/// for computing an edit sequence.
174/// These algorithms have different performance and all produce different output.
175#[derive(Debug, PartialEq, Eq, Clone, Copy, Default)]
176pub enum Algorithm {
177    /// A variation of the [`patience` diff algorithm described by Bram Cohen's blog post](https://bramcohen.livejournal.com/73318.html)
178    /// that uses a histogram to find the least common LCS.
179    /// Just like the `patience` diff algorithm, this algorithm usually produces
180    /// more human-readable output than Myers algorithm.
181    /// However, compared to the `patience` diff algorithm (which is slower than Myers algorithm),
182    /// the Histogram algorithm performs much better.
183    ///
184    /// The implementation here was originally ported from `git` but has been significantly
185    /// modified to improve performance.
186    /// As a result, it consistently **performs better than Myers algorithm** (5%-100%) over
187    /// a wide variety of test data.
188    ///
189    /// For pathological subsequences that only contain highly repeating tokens (64+ occurrences)
190    /// the algorithm falls back on Myers algorithm (with heuristics) to avoid quadratic behavior.
191    ///
192    /// Compared to Myers algorithm, the Histogram diff algorithm is more focused on providing
193    /// human-readable diffs instead of minimal diffs. In practice, this means that the edit sequences
194    /// produced by the histogram diff are often longer than those produced by Myers algorithm.
195    ///
196    /// The heuristic used by the histogram diff does not work well for inputs with small (often repeated)
197    /// tokens. For example, **character diffs do not work well** as most (English) text is made up of
198    /// a fairly small set of characters. The `Histogram` algorithm will automatically detect these cases and
199    /// fall back to Myers algorithm. However, this detection has a nontrivial overhead, so
200    /// if it's known upfront that the sort of tokens is very small, `Myers` algorithm should
201    /// be used instead.
202    #[default]
203    Histogram,
204    /// An implementation of the linear space variant of
205    /// [Myers  `O((N+M)D)` algorithm](http://www.xmailserver.org/diff2.pdf).
206    /// The algorithm is enhanced with preprocessing that removes
207    /// tokens that don't occur in the other file at all.
208    /// Furthermore, two heuristics for the middle snake search are implemented
209    /// that ensure reasonable runtime (mostly linear time complexity) even for large files.
210    ///
211    /// Due to the divide-and-conquer nature of the algorithm,
212    /// the edit sequences produced are still fairly small even when the middle snake
213    /// search is aborted by a heuristic.
214    /// However, the produced edit sequences are not guaranteed to be fully minimal.
215    /// If that property is vital to you, use the `MyersMinimal` algorithm instead.
216    ///
217    /// The implementation (including the preprocessing) is mostly
218    /// ported from `git` and `gnu-diff`, where Myers algorithm is used
219    /// as the default diff algorithm.
220    /// Therefore, the used heuristics have been heavily battle-tested and
221    /// are known to behave well over a large variety of inputs.
222    Myers,
223    /// Same as `Myers` but the early abort heuristics are disabled to guarantee
224    /// a minimal edit sequence.
225    /// This can mean significant slowdown in pathological cases.
226    MyersMinimal,
227}
228
229/// Represents the difference between two sequences of tokens.
230///
231/// A `Diff` stores which tokens were removed from the first sequence and which tokens were added to the second sequence.
232#[derive(Default)]
233pub struct Diff {
234    /// Tracks which tokens were removed from the first sequence (`before`), with
235    /// one entry for each one in the `before` sequence.
236    removed: Vec<bool>,
237    /// Tracks which tokens were added to the second sequence (`after`), with
238    /// one entry for each one in the `after` sequence.
239    added: Vec<bool>,
240}
241
242impl std::fmt::Debug for Diff {
243    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
244        f.debug_list().entries(self.hunks()).finish()
245    }
246}
247
248impl Diff {
249    /// Computes an edit-script that transforms `input.before` into `input.after` using
250    /// the specified `algorithm`
251    pub fn compute<T>(algorithm: Algorithm, input: &InternedInput<T>) -> Diff {
252        let mut diff = Diff::default();
253        diff.compute_with(algorithm, &input.before, &input.after, input.interner.num_tokens());
254        diff
255    }
256
257    /// Computes an edit-script that transforms `before` into `after` using
258    /// the specified `algorithm`.
259    pub fn compute_with(&mut self, algorithm: Algorithm, mut before: &[Token], mut after: &[Token], num_tokens: u32) {
260        assert!(
261            before.len() < i32::MAX as usize,
262            "imara-diff only supports up to {} tokens",
263            i32::MAX
264        );
265        assert!(
266            after.len() < i32::MAX as usize,
267            "imara-diff only supports up to {} tokens",
268            i32::MAX
269        );
270        self.removed.clear();
271        self.added.clear();
272        self.removed.resize(before.len(), false);
273        self.added.resize(after.len(), false);
274        let common_prefix = strip_common_prefix(&mut before, &mut after) as usize;
275        let common_postfix = strip_common_postfix(&mut before, &mut after);
276        let range = common_prefix..self.removed.len() - common_postfix as usize;
277        let removed = &mut self.removed[range];
278        let range = common_prefix..self.added.len() - common_postfix as usize;
279        let added = &mut self.added[range];
280        match algorithm {
281            Algorithm::Histogram => histogram::diff(before, after, removed, added, num_tokens),
282            Algorithm::Myers => myers::diff(before, after, removed, added, false),
283            Algorithm::MyersMinimal => myers::diff(before, after, removed, added, true),
284        }
285    }
286
287    /// Returns the total number of tokens that were added in the second sequence.
288    pub fn count_additions(&self) -> u32 {
289        self.added.iter().map(|&added| u32::from(added)).sum()
290    }
291
292    /// Returns the total number of tokens that were removed from the first sequence (`before`).
293    pub fn count_removals(&self) -> u32 {
294        self.removed.iter().map(|&removed| u32::from(removed)).sum()
295    }
296
297    /// Returns `true` if the token at the given index was removed from the first sequence (`before`).
298    ///
299    /// # Panics
300    ///
301    /// Panics if `token_idx` is out of bounds for the first sequence.
302    pub fn is_removed(&self, token_idx: u32) -> bool {
303        self.removed[token_idx as usize]
304    }
305
306    /// Returns `true` if the token at the given index was added to the second sequence (`after`).
307    ///
308    /// # Panics
309    ///
310    /// Panics if `token_idx` is out of bounds for the second sequence (`after`).
311    pub fn is_added(&self, token_idx: u32) -> bool {
312        self.added[token_idx as usize]
313    }
314
315    /// Postprocesses the diff to make it more human-readable. Certain hunks
316    /// have an ambiguous placement (even in a minimal diff) where they can move
317    /// downward or upward by removing a token (line) at the start and adding
318    /// one at the end (or the other way around). The postprocessing adjusts
319    /// these hunks according to a couple of rules:
320    ///
321    /// * Always merge multiple hunks if possible.
322    /// * Always try to create a single MODIFY hunk instead of multiple disjoint
323    ///   ADDED/REMOVED hunks.
324    /// * Move sliders as far down as possible.
325    pub fn postprocess_no_heuristic<T>(&mut self, input: &InternedInput<T>) {
326        self.postprocess_with_heuristic(input, NoSliderHeuristic)
327    }
328
329    /// Postprocesses the diff to make it more human-readable. Certain hunks
330    /// have an ambiguous placement (even in a minimal diff) where they can move
331    /// downward or upward by removing a token (line) at the start and adding
332    /// one at the end (or the other way around). The postprocessing adjusts
333    /// these hunks according to a couple of rules:
334    ///
335    /// * Always merge multiple hunks if possible.
336    /// * Always try to create a single MODIFY hunk instead of multiple disjoint
337    ///   ADDED/REMOVED hunks.
338    /// * Based on a line's indentation level, heuristically compute the most
339    ///   intuitive location to split lines.
340    /// * Move sliders as far down as possible.
341    pub fn postprocess_lines<T: AsRef<[u8]>>(&mut self, input: &InternedInput<T>) {
342        self.postprocess_with_heuristic(
343            input,
344            IndentHeuristic::new(|token| {
345                IndentLevel::for_ascii_line(input.interner[token].as_ref().iter().copied(), 8)
346            }),
347        )
348    }
349
350    /// Return an iterator that yields the changed hunks in this diff.
351    pub fn hunks(&self) -> HunkIter<'_> {
352        HunkIter {
353            removed: self.removed.iter(),
354            added: self.added.iter(),
355            pos_before: 0,
356            pos_after: 0,
357        }
358    }
359}
360
361/// A single change in a `Diff` that represents a range of tokens (`before`)
362/// in the first sequence that were replaced by a different range of tokens
363/// in the second sequence (`after`).
364///
365/// Each hunk identifies a contiguous region of change, where tokens from the `before` range
366/// should be replaced with tokens from the `after` range.
367#[derive(Debug, Clone, PartialEq, Eq, Hash, Default)]
368pub struct Hunk {
369    /// The range of token indices in the first sequence (`before`) that were removed.
370    pub before: Range<u32>,
371    /// The range of token indices in the second sequence (`after`) that were added.
372    pub after: Range<u32>,
373}
374
375impl Hunk {
376    /// Can be used instead of `Option::None` for better performance.
377    /// Because `imara-diff` does not support more than `i32::MAX` there is an unused bit pattern that can be used.
378    ///
379    /// It has some nice properties where it usually is not necessary to check for `None` separately:
380    /// Empty ranges fail contains checks and also fail smaller than checks.
381    pub const NONE: Hunk = Hunk {
382        before: u32::MAX..u32::MAX,
383        after: u32::MAX..u32::MAX,
384    };
385
386    /// Inverts a hunk so that it represents a change
387    /// that would undo this hunk.
388    pub fn invert(&self) -> Hunk {
389        Hunk {
390            before: self.after.clone(),
391            after: self.before.clone(),
392        }
393    }
394
395    /// Returns whether tokens are only inserted and not removed in this hunk.
396    pub fn is_pure_insertion(&self) -> bool {
397        self.before.is_empty()
398    }
399
400    /// Returns whether tokens are only removed and not inserted in this hunk.
401    pub fn is_pure_removal(&self) -> bool {
402        self.after.is_empty()
403    }
404
405    /// Performs a word-diff on this hunk.
406    ///
407    /// This requires passing the original [`input`](InternedInput) in order to look up
408    /// the tokens of the current hunk, which typically are lines.
409    /// Each token is split into words using the built-in [`words`] tokenizer.
410    /// The resulting word tokens are stored in a second [`diff_input`](InternedInput),
411    /// and a [`diff`](Diff) is computed on them, with basic post-processing applied.
412    ///
413    /// For performance reasons, this second [`diff_input`](InternedInput) as well as
414    /// the computed [`diff`](Diff) need to be passed as parameters so that they can be
415    /// re-used when iterating over hunks. Note that word tokens are always
416    /// added but never removed from the interner. Consider clearing it if you expect
417    /// your input to have a large vocabulary.
418    ///
419    /// # Examples
420    ///
421    /// ```
422    /// # use gix_imara_diff::{InternedInput, Diff, Algorithm};
423    /// // Compute diff normally
424    /// let before = "before text";
425    /// let after = "after text";
426    /// let mut lines = InternedInput::new(before, after);
427    /// let mut diff = Diff::compute(Algorithm::Histogram, &lines);
428    /// diff.postprocess_lines(&lines);
429    ///
430    /// // Compute word-diff per hunk, reusing allocations across iterations
431    /// let mut hunk_diff_input = InternedInput::default();
432    /// let mut hunk_diff = Diff::default();
433    /// for hunk in diff.hunks() {
434    ///   hunk.latin_word_diff(&lines, &mut hunk_diff_input, &mut hunk_diff);
435    ///   let added = hunk_diff.count_additions();
436    ///   let removed = hunk_diff.count_removals();
437    ///   println!("word-diff of this hunk has {added} additions and {removed} removals");
438    ///   // optionally, clear the interner:
439    ///   hunk_diff_input.clear();
440    /// }
441    /// ```
442    pub fn latin_word_diff<'a>(
443        &self,
444        input: &InternedInput<&'a str>,
445        word_tokens: &mut InternedInput<&'a str>,
446        diff: &mut Diff,
447    ) {
448        let Hunk { before, after } = self.clone();
449        word_tokens.update_before(
450            before
451                .map(|index| input.before[index as usize])
452                .map(|token| input.interner[token])
453                .flat_map(|line| words(line)),
454        );
455        word_tokens.update_after(
456            after
457                .map(|index| input.after[index as usize])
458                .map(|token| input.interner[token])
459                .flat_map(|line| words(line)),
460        );
461        diff.removed.clear();
462        diff.removed.resize(word_tokens.before.len(), false);
463        diff.added.clear();
464        diff.added.resize(word_tokens.after.len(), false);
465        if self.is_pure_removal() {
466            diff.removed.fill(true);
467        } else if self.is_pure_insertion() {
468            diff.added.fill(true);
469        } else {
470            diff.compute_with(
471                Algorithm::Myers,
472                &word_tokens.before,
473                &word_tokens.after,
474                word_tokens.interner.num_tokens(),
475            );
476            diff.postprocess_no_heuristic(word_tokens);
477        }
478    }
479}
480
481/// Yields all [`Hunk`]s in a file in monotonically increasing order.
482/// Monotonically increasing means here that the following holds for any two
483/// consecutive [`Hunk`]s `x` and `y`:
484///
485/// ``` no_compile
486/// assert!(x.before.end < y.before.start);
487/// assert!(x.after.end < y.after.start);
488/// ```
489///
490pub struct HunkIter<'diff> {
491    removed: slice::Iter<'diff, bool>,
492    added: slice::Iter<'diff, bool>,
493    pos_before: u32,
494    pos_after: u32,
495}
496
497impl Iterator for HunkIter<'_> {
498    type Item = Hunk;
499
500    fn next(&mut self) -> Option<Self::Item> {
501        loop {
502            let removed = (&mut self.removed).take_while(|&&removed| removed).count() as u32;
503            let added = (&mut self.added).take_while(|&&added| added).count() as u32;
504            if removed != 0 || added != 0 {
505                let start_before = self.pos_before;
506                let start_after = self.pos_after;
507                self.pos_before += removed;
508                self.pos_after += added;
509                let hunk = Hunk {
510                    before: start_before..self.pos_before,
511                    after: start_after..self.pos_after,
512                };
513                self.pos_before += 1;
514                self.pos_after += 1;
515                return Some(hunk);
516            } else if self.removed.len() == 0 && self.added.len() == 0 {
517                return None;
518            } else {
519                self.pos_before += 1;
520                self.pos_after += 1;
521            }
522        }
523    }
524}