Expand description
Imara-diff is a solid (imara in swahili) diff library for rust. Solid refers to the fact that imara-diff provides very good runtime performance even in pathologic cases so that your application never appears to freeze while waiting on a diff. The performance improvements are achieved using battle tested heuristics used in gnu-diff and git that are known to yield fast runtime and performance.
Imara-diff is also designed to be flexible so that it can be used with arbitrary collections and not just lists and strings and even allows reusing large parts of the computation when comparing the same file to multiple different files.
Imara-diff provides two diff algorithms:
- The linear-space variant of the well known Myers algorithm
- The Histogram algorithm which is a variant of the patience diff algorithm.
Myers algorithm has been enhanced with preprocessing and multiple heuristics to ensure fast runtime in pathological cases to avoid quadratic time complexity and closely matches the behaviour of gnu-diff and git. The Histogram algorithm was originally ported from git but has been heavily optimized. The Histogram algorithm outperforms Myers diff by 10% - 100% across a wide variety of workloads.
Imara-diffs algorithms have been benchmarked over a wide variety of real-world code.
For example while comparing multiple different linux kernel it performs up to 30 times better than the similar
crate:
§API Overview
§Preparing the input
To compute an input sequence is required. imara-diff
computes diffs on abstract
sequences represented as a slice of ids/tokens: Token
. To create
such a sequence from your input type (for example text) the input needs to be interned.
For that imara-diff
provides utilites in the form of InternedInput
struct and
TokenSource
trait to construct it. The InternedInput
contains the two sides of
the diff (used while computing the diff). As well as the interner that allows mapping
back tokens to ther original data.
The most common usecase for diff is comparing text. &str
implements TokenSource
by default segmentent the text into lines. So creating an input for a text-based diff
usually looks something like the following:
let before = "abc\ndef";
let after = "abc\ndefg";
let input = InternedInput::new(before, after);
assert_eq!(input.interner[input.before[0]], "abc\n");
Note that interning inputs is optional and you could choose a different strategy
for creating a sequence of tokens. Instead of using the Diff::compute
function,
Diff::compute_with
can be used to provide a list of tokens directly, entirely
bypassing the interning step.
§Computing the Diff
A diff of two sequences is represented by the Diff
struct and computed by
Diff::compute
/ Diff::compute_with
. Here also an algorithm can be choosen.
In most situations Algorithm::Histogram
is a good choice, refer to the docs
of Algorithm
for more details. After the initial computation the diff can be
postprocessed. If the diff is shown to a human you in some way (even indirectly) you
always want to use this. However, when only counting the number of changed tokens
quickly this can be skipped. The postprocessing allows you to provide your own
heuristic for selecting a slider position. An indentation-based heuristic is provided
which is a good fit for all text-based line-diffs. The internals of the heuristic are
public so a tweaked heuristic can be built on top.
let before = "abc\ndef";
let after = "abc\ndefg";
let input = InternedInput::new(before, after);
let mut diff = Diff::compute(Algorithm::Histogram, &input);
diff.postprocess_lines(&input);
assert!(!diff.is_removed(0) && !diff.is_added(0));
assert!(diff.is_removed(1) && diff.is_added(1));
§Accessing results
Diff
allows to query wether a particular position was removed/added on either
side of the diff with Diff::is_removed
/ Diff::is_added
. The number
of addtions/removeals can be quickly counted with Diff::count_removals
/
Diff::count_additions
. The most powerful/useful interface is the hunk iterator
Diff::hunks
which will return a list of additions/removals/modifications in the
order that they appear in the input.
Finnally if the unified_diff
feature is enabled a diff can be printed with
Diff::unified_diff
to print a unified diff/patch as shown by git diff
or diff -u
. Note that while the unified diff has a decent amount of flexability it is fairly
simplistic and not every formatting may be possible. It’s meant to cover common
situations but not cover every advanced usecase. Instead if you need more advanced
printing built your own printer on top of the Diff::hunks
iterator, for that you can
take inspiration from the builtin printer.
let before = r#"fn foo() -> Bar {
let mut foo = 2;
foo *= 50;
println!("hello world")
}
"#;
let after = r#"// lorem ipsum
fn foo() -> Bar {
let mut foo = 2;
foo *= 50;
println!("hello world");
println!("{foo}");
}
// foo
"#;
let input = InternedInput::new(before, after);
let mut diff = Diff::compute(Algorithm::Histogram, &input);
diff.postprocess_lines(&input);
assert_eq!(
diff.unified_diff(
&BasicLineDiffPrinter(&input.interner),
UnifiedDiffConfig::default(),
&input,
)
.to_string(),
r#"@@ -1,5 +1,8 @@
+// lorem ipsum
fn foo() -> Bar {
let mut foo = 2;
foo *= 50;
- println!("hello world")
+ println!("hello world");
+ println!("{foo}");
}
+// foo
"#
);
Modules§
Structs§
- Basic
Line Diff Printer - Diff
- Hunk
- A single change in a
Diff
that represents a range of tokens (before
) in the first sequence that were replaced by a diffrerent range of tokens in the second sequence (after
). - Hunk
Iter - Yields all
Hunk
s in a file in montonically increasing order. Montonically increasing means here that the following holds for any two consequtiveHunk
sx
andy
: - Indent
Heuristic - Indent
Level - Interned
Input - Two lists of interned tokens that a
Diff
can be computed from. - Interner
- An interner that allows for fast access of tokens produced by a
TokenSource
. - NoSlider
Heuristic - Token
- A token represented as an interned integer.
- Unified
Diff - Unified
Diff Config
Enums§
- Algorithm
imara-diff
supports multiple different algorithms for computing an edit sequence. These algorithms have different performance and all produce different output.