imara_diff/
lib.rs

1//! Imara-diff is a solid (imara in swahili) diff library for rust.
2//! Solid refers to the fact that imara-diff provides very good runtime performance even
3//! in pathologic cases so that your application never appears to freeze while waiting on a diff.
4//! The performance improvements are achieved using battle tested heuristics used in gnu-diff and git
5//! that are known to yield fast runtime and performance.
6//!
7//! Imara-diff is also designed to be flexible so that it can be used with arbitrary collections and
8//! not just lists and strings and even allows reusing large parts of the computation when
9//! comparing the same file to multiple different files.
10//!
11//! Imara-diff provides two diff algorithms:
12//!
13//! * The linear-space variant of the well known [**Myers** algorithm](http://www.xmailserver.org/diff2.pdf)
14//! * The **Histogram** algorithm which is a variant of the patience diff algorithm.
15//!
16//! Myers algorithm has been enhanced with preprocessing and multiple heuristics to ensure fast runtime in pathological
17//! cases to avoid quadratic time complexity and closely matches the behaviour of gnu-diff and git.
18//! The Histogram algorithm was originally ported from git but has been heavily optimized.
19//! The **Histogram algorithm outperforms Myers diff** by 10% - 100% across a **wide variety of workloads**.
20//!
21//! Imara-diffs algorithms have been benchmarked over a wide variety of real-world code.
22//! For example while comparing multiple different linux kernel it performs up to 30 times better than the `similar` crate:
23#![cfg_attr(doc, doc=concat!("<img width=\"600\" class=\"figure\" src=\"data:image/svg+xml;base64,", include_str!("../plots/linux_comparison.svg.base64"), "\"></img>"))]
24//!
25//! # API Overview
26//!
27//! ## Preparing the input
28//! To compute an input sequence is required. `imara-diff` computes diffs on abstract
29//! sequences represented as a slice of ids/tokens: [`Token`](crate::Token). To create
30//! such a sequence from your input type (for example text) the input needs to be interned.
31//! For that `imara-diff` provides utilites in the form of `InternedInput` struct and
32//! `TokenSource` trait to construct it. The [`InternedInput`] contains the two sides of
33//! the diff (used while computing the diff). As well as the interner that allows mapping
34//! back tokens to ther original data.
35//!
36//! The most common usecase for diff is comparing text. `&str` implements `TokenSource`
37//! by default segmentent the text into lines. So creating an input for a text-based diff
38//! usually looks something like the following:
39//!
40//! ```
41//! # use imara_diff::InternedInput;
42//! #
43//! let before = "abc\ndef";
44//! let after = "abc\ndefg";
45//! let input = InternedInput::new(before, after);
46//! assert_eq!(input.interner[input.before[0]], "abc\n");
47//! ```
48//!
49//! Note that interning inputs is optional and you could choose a different strategy
50//! for creating a sequence of tokens. Instead of using the [`Diff::compute`] function,
51//! [`Diff::compute_with`] can be used to provide a list of tokens directly, entirely
52//! bypassing the interning step.
53//!
54//! ## Computing the Diff
55//!
56//! A diff of two sequences is represented by the [`Diff`] struct and computed by
57//! [`Diff::compute`] / [`Diff::compute_with`]. Here also an algorithm can be choosen.
58//! In most situations [`Algorithm::Histogram`] is a good choice, refer to the docs
59//! of [`Algorithm`] for more details. After the initial computation the diff can be
60//! postprocessed. If the diff is shown to a human you in some way (even indirectly) you
61//! always want to use this. However, when only counting the number of changed tokens
62//! quickly this can be skipped. The postprocessing allows you to provide your own
63//! heuristic for selecting a slider position. An indentation-based heuristic is provided
64//! which is a good fit for all text-based line-diffs. The internals of the heuristic are
65//! public so a tweaked heuristic can be built on top.
66//!
67//! ```
68//! # use imara_diff::{InternedInput, Diff, Algorithm};
69//! #
70//! let before = "abc\ndef";
71//! let after = "abc\ndefg";
72//! let input = InternedInput::new(before, after);
73//! let mut diff = Diff::compute(Algorithm::Histogram, &input);
74//! diff.postprocess_lines(&input);
75//! assert!(!diff.is_removed(0) && !diff.is_added(0));
76//! assert!(diff.is_removed(1) && diff.is_added(1));
77//! ```
78//!
79//! ## Accessing results
80//!
81//! [`Diff`] allows to query wether a particular position was removed/added on either
82//! side of the diff with [`Diff::is_removed`] / [`Diff::is_added`]. The number
83//! of addtions/removeals can be quickly counted with [`Diff::count_removals`] /
84//! [`Diff::count_additions`]. The most powerful/useful interface is the hunk iterator
85//! [`Diff::hunks`] which will return a list of additions/removals/modifications in the
86//! order that they appear in the input.
87//!
88//! Finnally if the `unified_diff` feature is enabled a diff can be printed with
89//! [`Diff::unified_diff`] to print a unified diff/patch as shown by `git diff` or `diff
90//! -u`. Note that while the unified diff has a decent amount of flexability it is fairly
91//! simplistic and not every formatting may be possible. It's meant to cover common
92//! situations but not cover every advanced usecase. Instead if you need more advanced
93//! printing built your own printer on top of the `Diff::hunks` iterator, for that you can
94//! take inspiration from the builtin printer.
95//!
96//! ```
97//! # use imara_diff::{InternedInput, Diff, Algorithm, BasicLineDiffPrinter, UnifiedDiffConfig};
98//! #
99//!
100//! let before = r#"fn foo() -> Bar {
101//!     let mut foo = 2;
102//!     foo *= 50;
103//!     println!("hello world")
104//! }
105//! "#;
106//!
107//! let after = r#"// lorem ipsum
108//! fn foo() -> Bar {
109//!     let mut foo = 2;
110//!     foo *= 50;
111//!     println!("hello world");
112//!     println!("{foo}");
113//! }
114//! // foo
115//! "#;
116//! let input = InternedInput::new(before, after);
117//! let mut diff = Diff::compute(Algorithm::Histogram, &input);
118//! diff.postprocess_lines(&input);
119//!
120//! assert_eq!(
121//!     diff.unified_diff(
122//!         &BasicLineDiffPrinter(&input.interner),
123//!         UnifiedDiffConfig::default(),
124//!         &input,
125//!     )
126//!     .to_string(),
127//!     r#"@@ -1,5 +1,8 @@
128//! +// lorem ipsum
129//!  fn foo() -> Bar {
130//!      let mut foo = 2;
131//!      foo *= 50;
132//! -    println!("hello world")
133//! +    println!("hello world");
134//! +    println!("{foo}");
135//!  }
136//! +// foo
137//! "#
138//! );
139//! ```
140
141use std::ops::Range;
142use std::slice;
143
144use crate::util::{strip_common_postfix, strip_common_prefix};
145
146pub use crate::slider_heuristic::{
147    IndentHeuristic, IndentLevel, NoSliderHeuristic, SliderHeuristic,
148};
149pub use intern::{InternedInput, Interner, Token, TokenSource};
150#[cfg(feature = "unified_diff")]
151pub use unified_diff::{BasicLineDiffPrinter, UnifiedDiff, UnifiedDiffConfig, UnifiedDiffPrinter};
152
153mod histogram;
154mod intern;
155mod myers;
156mod postprocess;
157mod slider_heuristic;
158pub mod sources;
159#[cfg(feature = "unified_diff")]
160mod unified_diff;
161mod util;
162
163#[cfg(test)]
164mod tests;
165
166/// `imara-diff` supports multiple different algorithms
167/// for computing an edit sequence.
168/// These algorithms have different performance and all produce different output.
169#[derive(Debug, PartialEq, Eq, Clone, Copy, Default)]
170pub enum Algorithm {
171    /// A variation of the [`patience` diff algorithm described by Bram Cohen's blog post](https://bramcohen.livejournal.com/73318.html)
172    /// that uses a histogram to find the least common LCS.
173    /// Just like the `patience` diff algorithm, this algorithm usually produces
174    /// more human readable output then myers algorithm.
175    /// However compared to the `patience` diff algorithm (which is slower then myers algorithm),
176    /// the Histogram algorithm performs much better.
177    ///
178    /// The implementation here was originally ported from `git` but has been significantly
179    /// modified to improve performance.
180    /// As a result it consistently **performs better then myers algorithm** (5%-100%) over
181    /// a wide variety of test data. For example a benchmark of diffing linux kernel commits is shown below:
182    #[cfg_attr(doc, doc=concat!("<img width=\"600\" class=\"figure\" src=\"data:image/svg+xml;base64,", include_str!("../plots/linux_speedup.svg.base64"), "\"></img>"))]
183    ///
184    /// For pathological subsequences that only contain highly repeating tokens (64+ occurrences)
185    /// the algorithm falls back on Myers algorithm (with heuristics) to avoid quadratic behavior.
186    ///
187    /// Compared to Myers algorithm, the Histogram diff algorithm is more focused on providing
188    /// human readable diffs instead of minimal diffs. In practice this means that the edit-sequences
189    /// produced by the histogram diff are often longer then those produced by Myers algorithm.
190    ///
191    /// The heuristic used by the histogram diff does not work well for inputs with small (often repeated)
192    /// tokens. For example **character diffs do not work well** as most (english) text is madeup of
193    /// a fairly small set of characters. The `Histogram` algorithm will automatically these cases and
194    /// fallback to Myers algorithm. However this detection has a nontrivial overhead, so
195    /// if its known upfront that the sort of tokens is very small `Myers` algorithm should
196    /// be used instead.
197    #[default]
198    Histogram,
199    /// An implementation of the linear space variant of
200    /// [Myers  `O((N+M)D)` algorithm](http://www.xmailserver.org/diff2.pdf).
201    /// The algorithm is enhanced with preprocessing that removes
202    /// tokens that don't occur in the other file at all.
203    /// Furthermore two heuristics to the middle snake search are implemented
204    /// that ensure reasonable runtime (mostly linear time complexity) even for large files.
205    ///
206    /// Due to the divide and conquer nature of the algorithm
207    /// the edit sequenced produced are still fairly small even when the middle snake
208    /// search is aborted by a heuristic.
209    /// However, the produced edit sequences are not guaranteed to be fully minimal.
210    /// If that property is vital to you, use the `MyersMinimal` algorithm instead.
211    ///
212    /// The implementation (including the preprocessing) are mostly
213    /// ported from `git` and `gnu-diff` where Myers algorithm is used
214    /// as the default diff algorithm.
215    /// Therefore the used heuristics have been heavily battle tested and
216    /// are known to behave well over a large variety of inputs
217    Myers,
218    /// Same as `Myers` but the early abort heuristics are disabled to guarantee
219    /// a minimal edit sequence.
220    /// This can mean significant slowdown in pathological cases.
221    MyersMinimal,
222}
223
224impl Algorithm {
225    #[cfg(test)]
226    const ALL: [Self; 2] = [Algorithm::Histogram, Algorithm::Myers];
227}
228
229#[derive(Default)]
230pub struct Diff {
231    removed: Vec<bool>,
232    added: Vec<bool>,
233}
234
235impl std::fmt::Debug for Diff {
236    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
237        f.debug_list().entries(self.hunks()).finish()
238    }
239}
240
241impl Diff {
242    /// Computes an edit-script that transforms `input.before` into `input.after` using
243    /// the specified `algorithm`
244    pub fn compute<T>(algorithm: Algorithm, input: &InternedInput<T>) -> Diff {
245        let mut diff = Diff::default();
246        diff.compute_with(
247            algorithm,
248            &input.before,
249            &input.after,
250            input.interner.num_tokens(),
251        );
252        diff
253    }
254
255    /// Computes an edit-script that transforms `before` into `after` using
256    /// the specified `algorithm`.
257    pub fn compute_with(
258        &mut self,
259        algorithm: Algorithm,
260        mut before: &[Token],
261        mut after: &[Token],
262        num_tokens: u32,
263    ) {
264        assert!(
265            before.len() < i32::MAX as usize,
266            "imara-diff only supports up to {} tokens",
267            i32::MAX
268        );
269        assert!(
270            after.len() < i32::MAX as usize,
271            "imara-diff only supports up to {} tokens",
272            i32::MAX
273        );
274        self.removed.clear();
275        self.added.clear();
276        self.removed.resize(before.len(), false);
277        self.added.resize(after.len(), false);
278        let common_prefix = strip_common_prefix(&mut before, &mut after) as usize;
279        let common_postfix = strip_common_postfix(&mut before, &mut after);
280        let range = common_prefix..self.removed.len() - common_postfix as usize;
281        let removed = &mut self.removed[range];
282        let range = common_prefix..self.added.len() - common_postfix as usize;
283        let added = &mut self.added[range];
284        match algorithm {
285            Algorithm::Histogram => histogram::diff(before, after, removed, added, num_tokens),
286            Algorithm::Myers => myers::diff(before, after, removed, added, false),
287            Algorithm::MyersMinimal => myers::diff(before, after, removed, added, true),
288        }
289    }
290
291    pub fn count_additions(&self) -> u32 {
292        self.added.iter().map(|&added| added as u32).sum()
293    }
294
295    pub fn count_removals(&self) -> u32 {
296        self.removed.iter().map(|&removed| removed as u32).sum()
297    }
298
299    pub fn is_removed(&self, token_idx: u32) -> bool {
300        self.removed[token_idx as usize]
301    }
302
303    pub fn is_added(&self, token_idx: u32) -> bool {
304        self.added[token_idx as usize]
305    }
306
307    /// Postprocesses the diff to make it more human readable. Certain bvhunks
308    /// have an amigous placement (event in a minimal diff) where they can move
309    /// downward or upward by removing a token (line) at the start and adding
310    /// one at the end (or the other way around). The postprocessing adjust
311    /// these hunks according to a coulple rules:
312    ///
313    /// * Always merge multiple hunks if possible.
314    /// * Always try to create a single MODIFY hunk instead of multiple disjoint
315    ///   ADDED/REMOVED hunks
316    /// * Move sliders as far down as possible.
317    pub fn postprocess_no_heuristic<T>(&mut self, input: &InternedInput<T>) {
318        self.postprocess_with_heuristic(input, NoSliderHeuristic)
319    }
320
321    /// Postprocesses the diff to make it more human readable. Certain bvhunks
322    /// have an amigous placement (event in a minimal diff) where they can move
323    /// downward or upward by removing a token (line) at the start and adding
324    /// one at the end (or the other way around). The postprocessing adjust
325    /// these hunks according to a coulple rules:
326    ///
327    /// * Always merge multiple hunks if possible.
328    /// * Always try to create a single MODIFY hunk instead of multiple disjoint
329    ///   ADDED/REMOVED hunks
330    /// * based on a lines indentation level heuristically compute the most
331    ///   intutive location to split lines.
332    /// * Move sliders as far down as possible.
333    pub fn postprocess_lines<T: AsRef<[u8]>>(&mut self, input: &InternedInput<T>) {
334        self.postprocess_with_heuristic(
335            input,
336            IndentHeuristic::new(|token| {
337                IndentLevel::for_ascii_line(input.interner[token].as_ref().iter().copied(), 8)
338            }),
339        )
340    }
341
342    /// An iterator that yields the changed hunks in this diff
343    pub fn hunks(&self) -> HunkIter<'_> {
344        HunkIter {
345            removed: self.removed.iter(),
346            added: self.added.iter(),
347            pos_before: 0,
348            pos_after: 0,
349        }
350    }
351}
352
353/// A single change in a `Diff` that represents a range of tokens (`before`)
354/// in the first sequence that were replaced by a diffrerent range of tokens
355/// in the second sequence (`after`).
356///
357/// Tokens that are a
358#[derive(Debug, Clone, PartialEq, Eq, Hash, Default)]
359pub struct Hunk {
360    pub before: Range<u32>,
361    pub after: Range<u32>,
362}
363
364impl Hunk {
365    /// Can be used instead of `Option::None` for better performance.
366    /// Because `imara-diff` does not support more then `i32::MAX` there is an unused bit pattern that can be used.
367    /// Has some nice properties where it usually is not necessary to check for `None` seperatly:
368    /// Empty ranges fail contains checks and also faills smaller then checks.
369    pub const NONE: Hunk = Hunk {
370        before: u32::MAX..u32::MAX,
371        after: u32::MAX..u32::MAX,
372    };
373
374    /// Inverts a hunk so that it represents a change
375    /// that would undo this hunk.
376    pub fn invert(&self) -> Hunk {
377        Hunk {
378            before: self.after.clone(),
379            after: self.before.clone(),
380        }
381    }
382
383    /// Returns whether tokens are only inserted and not removed in this hunk.
384    pub fn is_pure_insertion(&self) -> bool {
385        self.before.is_empty()
386    }
387
388    /// Returns whether tokens are only removed and not remved in this hunk.
389    pub fn is_pure_removal(&self) -> bool {
390        self.after.is_empty()
391    }
392}
393
394/// Yields all [`Hunk`]s in a file in montonically increasing order.
395/// Montonically increasing means here that the following holds for any two
396/// consequtive [`Hunk`]s `x` and `y`:
397///
398/// ``` no_compile
399/// assert!(x.before.end < y.before.start);
400/// assert!(x.after.end < y.after.start);
401/// ```
402///
403pub struct HunkIter<'diff> {
404    removed: slice::Iter<'diff, bool>,
405    added: slice::Iter<'diff, bool>,
406    pos_before: u32,
407    pos_after: u32,
408}
409
410impl Iterator for HunkIter<'_> {
411    type Item = Hunk;
412
413    fn next(&mut self) -> Option<Self::Item> {
414        loop {
415            let removed = (&mut self.removed).take_while(|&&removed| removed).count() as u32;
416            let added = (&mut self.added).take_while(|&&added| added).count() as u32;
417            if removed != 0 || added != 0 {
418                let start_before = self.pos_before;
419                let start_after = self.pos_after;
420                self.pos_before += removed;
421                self.pos_after += added;
422                let hunk = Hunk {
423                    before: start_before..self.pos_before,
424                    after: start_after..self.pos_after,
425                };
426                self.pos_before += 1;
427                self.pos_after += 1;
428                return Some(hunk);
429            } else if self.removed.len() == 0 && self.added.len() == 0 {
430                return None;
431            } else {
432                self.pos_before += 1;
433                self.pos_after += 1;
434            }
435        }
436    }
437}