Expand description
ffuzzy: ssdeep-compatible Fuzzy Hashing Library in pure Rust
ssdeep is a program for computing context triggered piecewise hashes (CTPH). Also called fuzzy hashes, CTPH can match inputs that have homologies. Such inputs have sequences of identical bytes in the same order, although bytes in between these sequences may be different in both content and length.
This crate is the port of ssdeep (libfuzzy) to the Rust language, created by a ssdeep maintainer, Tsukasa OI.
This crate is designed to be a replacement to the original ssdeep library, libfuzzy. So, it implements some “easy” functions for daily use cases.
Some interface originates from ffuzzy++, a C++ port of libfuzzy written by Tsukasa OI with additional features. They enabled more efficient handling of fuzzy hashes on large scale clustering.
If you understand both the property of fuzzy hashes and this crate well, you can cluster the fuzzy hashes over 5 times faster than libfuzzy.
License (GNU GPL v2 or later)
This crate (as a whole library) is licensed under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
However, some portions are licensed under more permissive licenses (see the source code for details).
Performance
While ffuzzy++ performed well in large scale clustering, some use cases were slower than libfuzzy. In contrast, this crate expects (at least) comparable performance to libfuzzy even if only “easy” functions are used and no unsafe features are enabled.
If we unlock the performance by the unsafe feature, it’s generally faster than
libfuzzy and even comparable to ffuzzy++ (depends on various conditions, though).
*_unchecked functions will be useful when you use this crate as a part of
specialized large scale clustering applications.
Features New in this Crate
Dual fuzzy hash object
While the fuzzy hash generator normally produces fuzzy hashes without normalization but comparing two fuzzy hashes requires two normalized ones. It enforced users to preserve both normalized and raw fuzzy hashes to collerate the original (raw) fuzzy hash and the comparison-friendly (normalized) one.
In this crate, DualFuzzyHash and LongDualFuzzyHash allows storing both forms
efficiently, achieving the compression ratio of about 5 / 8.
Usage: Basic
Hashing a File
// Required Features: "std" and "easy-functions"
fn main() -> Result<(), ssdeep::GeneratorOrIOError> {
let fuzzy_hash = ssdeep::hash_file("data/examples/hello.txt")?;
let fuzzy_hash_str = fuzzy_hash.to_string();
assert_eq!(fuzzy_hash_str, "3:aaX8v:aV");
Ok(())
}Comparing Two Fuzzy Hashes
// Required Feature: "easy-functions"
let score = ssdeep::compare(
"6:3ll7QzDkmJmMHkQoO/llSZEnEuLszmbMAWn:VqDk5QtLbW",
"6:3ll7QzDkmQjmMoDHglHOxPWT0lT0lT0lB:VqDk+n"
).unwrap();
assert_eq!(score, 46);Usage: Advanced
Hashing a Buffer
use ssdeep::{Generator, RawFuzzyHash};
let mut generator = Generator::new();
let buf: &[u8] = b"Hello, World!";
// Optional but supplying the *total* input size first improves the performance.
// `buf.len` for `update` and `1` for `update_by_iter` (see below).
generator.set_fixed_input_size_in_usize(buf.len() + 1).unwrap();
// Update the internal state of the generator.
// Of course, you can call `update`-family functions multiple times.
generator.update(buf);
generator.update_by_iter(core::iter::repeat(b'\n').take(1)); // append one '\n'
// Retrieve the fuzzy hash and convert to the string.
let hash: RawFuzzyHash = generator.finalize().unwrap();
assert_eq!(hash.to_string(), "3:aaX8v:aV");Comparing Fuzzy Hashes
use core::str::FromStr;
use ssdeep::{FuzzyHash, FuzzyHashCompareTarget};
// Those fuzzy hash strings are "normalized" so that easier to compare.
let str1 = "12288:+ySwl5P+C5IxJ845HYV5sxOH/cccccccei:+Klhav84a5sxJ";
let str2 = "12288:+yUwldx+C5IxJ845HYV5sxOH/cccccccex:+glvav84a5sxK";
let hash1: FuzzyHash = FuzzyHash::from_str(str1).unwrap();
let hash2: FuzzyHash = FuzzyHash::from_str(str2).unwrap();
// Note that converting the (normalized) fuzzy hash object back to the string
// may not preserve the original string. To preserve the original fuzzy hash
// string too, consider using dual fuzzy hashes (such like DualFuzzyHash) that
// preserves the original string in the compressed format.
// * str1: "12288:+ySwl5P+C5IxJ845HYV5sxOH/cccccccei:+Klhav84a5sxJ"
// * hash1: "12288:+ySwl5P+C5IxJ845HYV5sxOH/cccei:+Klhav84a5sxJ"
assert_ne!(hash1.to_string(), str1);
// If we have number of fuzzy hashes and a hash is compared more than once,
// storing those hashes as FuzzyHash objects is faster.
assert_eq!(hash1.compare(&hash2), 88);
// But there's another way of comparison.
// If you compare "a fuzzy hash" with "other many fuzzy hashes", this method
// (using FuzzyHashCompareTarget as "a fuzzy hash") is much, much faster.
let mut target: FuzzyHashCompareTarget = FuzzyHashCompareTarget::new();
target.init_from(&hash1);
assert_eq!(target.compare(&hash2), 88);Features
alloc,std(default)
This crate supportsno_std(by disabling both of them) andallocandstdare built on the minimumno_stdimplementation. Those features enable implementations that depend onallocandstd, respectively.easy-functions(default)
It provides easy-to-use high-level functions.unsafe(fast but unsafe)
This crate is optionally unsafe. By default, this crate is built with 100% safe Rust (this default might change before version 1.0 but safe Rust code will be preserved). Enabling this feature enables unsafe Rust code (although unsafe/safe code share the most using macros).nightly
This feature enables some features specific to the Nightly Rust. Note that this feature heavily depends on the version ofrustcand should not be considered stable (don’t expect SemVer-compatible semantics).opt-reduce-fnv-table(not recommended to enable this)
ssdeep uses partial (the lowest 6 bits of) FNV hash. While default table lookup instead of full FNV hash computation is faster on most cases, it will not affect the performance much on some configurations. Enabling this option will turn off using precomputed FNV hash table (4KiB). Note that it’s not recommended to enable this feature for memory footprint since a generator is about 2KiB in size and a temporary object used for fuzzy hash comparison is about 1KiB in size (so that reducing 4KiB does not benefit well).tests-slowandtests-very-slow
They will enable “slow” (may take seconds) and “very slow” (may take minutes) tests, respectively.
History and Main Contributors of ssdeep
Andrew Tridgell made the program called “spamsum” to detect a mail similar to a known spam.
Jesse Kornblum authored the program “ssdeep” based on spamsum by adding solid engine to Andrew’s work. Jesse continued working to improve ssdeep for years.
Helmut Grohne authored his re-written and optimized, streaming fuzzy hashing engine that enabled multi-threaded runs and a capability to process files without seeking.
Tsukasa OI, first helped resolving the license issue on the edit distance code (which was not open source), further optimized the engine and introduced bit-parallel string processing functions. He wrote ssdeep compatible engines multiple times, including ffuzzy++.
References
- Jesse Kornblum (2006) “Identifying almost identical files using context triggered piecewise hashing” (doi:10.1016/j.diin.2006.06.015)
For Developers
Modules
- Utility (constants) related to block hash part of the fuzzy hash.
- Utility related to block size part of the fuzzy hash.
- Module containing certain constraints about fuzzy hash data.
- Module containing internal efficient block hash implementation.
- Module containing internal hash functions.
Structs
- An efficient position array-based fuzzy hash comparison target.
- An efficient fixed size fuzzy hash representation.
- An efficient compressed fuzzy hash representation, containing both normalized and raw block hash contents.
- Fuzzy hash generator.
- The error type for parse operations of
FuzzyHashData. - ParseErrorEither
easy-functionsThe error type representing a parse error for one of the operands specified to thecomparefunction.
Enums
- A type to represent relation between two block sizes.
- An enumeration representing a cause of a generic fuzzy hash error.
- The error type representing an invalid or an unsupported operation of the generator.
- GeneratorOrIOError
stdandeasy-functionsThe error type describing either a generator error or an I/O error. - An enumeration representing a cause of a fuzzy hash parse error.
- A part which (possibly) caused a fuzzy hash parse error.
- ParseErrorSide
easy-functionsThe operand (side) which caused a parse error
Traits
- The trait implementing a
FuzzyHashDataparse error.
Functions
- compare
easy-functionsCompare two fuzzy hashes. - hash_buf
easy-functionsGenerates a fuzzy hash from a given buffer. - hash_file
stdandeasy-functionsGenerates a fuzzy hash from a given file. - hash_stream
stdandeasy-functionsGenerates a fuzzy hash from a given reader stream.
Type Definitions
- Regular (truncated) dual fuzzy hash type which contains both normalized and raw contents.
- Regular (truncated) normalized fuzzy hash type.
- Long (non-truncated) dual fuzzy hash type which contains both normalized and raw contents.
- Long (non-truncated) normalized fuzzy hash type.
- Long (non-truncated) raw fuzzy hash type.
- Regular (truncated) raw fuzzy hash type.