Crate imara_diff

Expand description

Imara-diff is a solid (imara in swahili) diff library for rust. Solid refers to the fact that imara-diff provides very good runtime performance even in pathologic cases so that your application never appears to freeze while waiting on a diff. The performance improvements are achieved using battle tested heuristics used in gnu-diff and git that are known to yield fast runtime and performance.

Imara-diff is also designed to be flexible so that it can be used with arbitrary collections and not just lists and strings and even allows reusing large parts of the computation when comparing the same file to multiple different files.

Imara-diff provides two diff algorithms:

The linear-space variant of the well known Myers algorithm
The Histogram algorithm which is a variant of the patience diff algorithm.

Myers algorithm has been enhanced with preprocessing and multiple heuristics to ensure fast runtime in pathological cases to avoid quadratic time complexity and closely matches the behaviour of gnu-diff and git. The Histogram algorithm was originally ported from git but has been heavily optimized. The Histogram algorithm outperforms Myers diff by 10% - 100% across a wide variety of workloads.

Imara-diffs algorithms have been benchmarked over a wide variety of real-world code. For example while comparing multiple different linux kernel it performs up to 30 times better than the similar crate:

§API Overview

§Preparing the input

To compute an input sequence is required. imara-diff computes diffs on abstract sequences represented as a slice of ids/tokens: Token. To create such a sequence from your input type (for example text) the input needs to be interned. For that imara-diff provides utilites in the form of InternedInput struct and TokenSource trait to construct it. The InternedInput contains the two sides of the diff (used while computing the diff). As well as the interner that allows mapping back tokens to ther original data.

The most common usecase for diff is comparing text. &str implements TokenSource by default segmentent the text into lines. So creating an input for a text-based diff usually looks something like the following:

let before = "abc\ndef";
let after = "abc\ndefg";
let input = InternedInput::new(before, after);
assert_eq!(input.interner[input.before[0]], "abc\n");

Note that interning inputs is optional and you could choose a different strategy for creating a sequence of tokens. Instead of using the Diff::compute function, Diff::compute_with can be used to provide a list of tokens directly, entirely bypassing the interning step.

§Computing the Diff

A diff of two sequences is represented by the Diff struct and computed by Diff::compute / Diff::compute_with. Here also an algorithm can be choosen. In most situations Algorithm::Histogram is a good choice, refer to the docs of Algorithm for more details. After the initial computation the diff can be postprocessed. If the diff is shown to a human you in some way (even indirectly) you always want to use this. However, when only counting the number of changed tokens quickly this can be skipped. The postprocessing allows you to provide your own heuristic for selecting a slider position. An indentation-based heuristic is provided which is a good fit for all text-based line-diffs. The internals of the heuristic are public so a tweaked heuristic can be built on top.

let before = "abc\ndef";
let after = "abc\ndefg";
let input = InternedInput::new(before, after);
let mut diff = Diff::compute(Algorithm::Histogram, &input);
diff.postprocess_lines(&input);
assert!(!diff.is_removed(0) && !diff.is_added(0));
assert!(diff.is_removed(1) && diff.is_added(1));

§Accessing results

Diff allows to query wether a particular position was removed/added on either side of the diff with Diff::is_removed / Diff::is_added. The number of addtions/removeals can be quickly counted with Diff::count_removals / Diff::count_additions. The most powerful/useful interface is the hunk iterator Diff::hunks which will return a list of additions/removals/modifications in the order that they appear in the input.

Finnally if the unified_diff feature is enabled a diff can be printed with Diff::unified_diff to print a unified diff/patch as shown by git diff or diff -u. Note that while the unified diff has a decent amount of flexability it is fairly simplistic and not every formatting may be possible. It’s meant to cover common situations but not cover every advanced usecase. Instead if you need more advanced printing built your own printer on top of the Diff::hunks iterator, for that you can take inspiration from the builtin printer.


let before = r#"fn foo() -> Bar {
    let mut foo = 2;
    foo *= 50;
    println!("hello world")
}
"#;

let after = r#"// lorem ipsum
fn foo() -> Bar {
    let mut foo = 2;
    foo *= 50;
    println!("hello world");
    println!("{foo}");
}
// foo
"#;
let input = InternedInput::new(before, after);
let mut diff = Diff::compute(Algorithm::Histogram, &input);
diff.postprocess_lines(&input);

assert_eq!(
    diff.unified_diff(
        &BasicLineDiffPrinter(&input.interner),
        UnifiedDiffConfig::default(),
        &input,
    )
    .to_string(),
    r#"@@ -1,5 +1,8 @@
+// lorem ipsum
 fn foo() -> Bar {
     let mut foo = 2;
     foo *= 50;
-    println!("hello world")
+    println!("hello world");
+    println!("{foo}");
 }
+// foo
"#
);

Modules§

sources

Structs§

BasicLineDiffPrinter
Diff
Hunk: A single change in a Diff that represents a range of tokens (before) in the first sequence that were replaced by a diffrerent range of tokens in the second sequence (after).
HunkIter: Yields all Hunks in a file in montonically increasing order. Montonically increasing means here that the following holds for any two consequtive Hunks x and y:
IndentHeuristic
IndentLevel
InternedInput: Two lists of interned tokens that a Diff can be computed from.
Interner: An interner that allows for fast access of tokens produced by a TokenSource.
NoSliderHeuristic
Token: A token represented as an interned integer.
UnifiedDiff
UnifiedDiffConfig

Enums§

Algorithm: imara-diff supports multiple different algorithms for computing an edit sequence. These algorithms have different performance and all produce different output.

Traits§

SliderHeuristic
TokenSource
UnifiedDiffPrinter

Crate imara_diffCopy item path