FuzzyHashCompareTarget

Struct FuzzyHashCompareTarget 

Source
pub struct FuzzyHashCompareTarget { /* private fields */ }
Expand description

An efficient position array-based fuzzy hash comparison target.

It can be built from a normalized FuzzyHashData object and represents the normalized contents of two block hashes as two position arrays.

Although that this structure is large, it is particularly useful if you compare many of fuzzy hashes and you can fix one of the operands (this is usually over 10 times faster than batched fuzzy_compare calls in ssdeep 2.13). Even if we generate this object each time we compare two fuzzy hashes, it’s usually faster than fuzzy_compare in ssdeep 2.13.

In fact, if you just compare two fuzzy hashes in this crate, a temporary FuzzyHashCompareTarget object containing two block hashes or corresponding half-baked object containing one block hash, will be created from either side of the comparison.

See also:

§Examples

// Requires the global allocator to use `Vec` (default on std).
use ssdeep::{FuzzyHash, FuzzyHashCompareTarget};

// Brute force comparison
let hashes: Vec<FuzzyHash> = Vec::new();
/* ... add fuzzy hashes to `hashes` ... */

let mut target: FuzzyHashCompareTarget = FuzzyHashCompareTarget::new();
for hash1 in &hashes {
    target.init_from(hash1);
    for hash2 in &hashes {
        let score = target.compare(hash2);
        /* ... */
    }
}

Implementations§

Source§

impl FuzzyHashCompareTarget

Source

pub const LOG_BLOCK_SIZE_CAPPING_BORDER: u8 = 4u8

The lower bound (inclusive) of the base-2 logarithm form of the block size in which the score capping is no longer required.

If log_block_size is equal to or larger than this value and len1 and len2 are at least block_hash::MIN_LCS_FOR_COMPARISON in size, Self::score_cap_on_block_hash_comparison(log_block_size, len1, len2) is guaranteed to be 100 or greater.

The score “cap” is computed as (1 << log_block_size) * min(len1, len2). If this always guaranteed to be 100 or greater, capping the score is not longer required.

See also: “Fuzzy Hash Comparison” section of FuzzyHashData

§Backgrounds
§Theorem

For all positive integers a, b and c, a <= b * c iff (a + b - 1) / b <= c (where ceil(a/b) == (a + b - 1) / b).

This is proven by Z3 and Coq in the source code:

  • Z3 + Python:
    dev/prover/compare/blocksize_capping_theorem.py
  • Coq:
    dev/prover/compare/blocksize_capping_theorem.v
§The Minimum Score Cap

This is expressed as (1 << log_block_size) * MIN_LCS_FOR_COMPARISON because both block hashes must at least as long as block_hash::MIN_LCS_FOR_COMPARISON to perform edit distance-based scoring.

§Computing the Constant

Applying the theorem above, 100 <= (1 << log_block_size) * MIN_LCS_FOR_COMPARISON is equivalent to (100 + MIN_LCS_FOR_COMPARISON - 1) / MIN_LCS_FOR_COMPARISON <= (1 << log_block_size).

This leads to the expression to define this constant.

Source

pub fn new() -> Self

Creates a new FuzzyHashCompareTarget object with empty contents.

This is equivalent to the fuzzy hash string 3::.

Source

pub fn log_block_size(&self) -> u8

The base-2 logarithm form of the comparison target’s block size.

See also: “Block Size” section of FuzzyHashData

Source

pub fn block_size(&self) -> u32

The block size of the comparison target.

Source

pub fn block_hash_1( &self, ) -> impl '_ + BlockHashPositionArrayImpl + BlockHashPositionArrayImplUnchecked

Position array-based representation of the block hash 1.

This method provides raw access to the internal efficient block hash representation and fast bit-parallel string functions.

You are not recommended to use this unless you know the internal details deeply.

The result has the same lifetime as this object and implements following traits:

Source

pub fn block_hash_2( &self, ) -> impl '_ + BlockHashPositionArrayImpl + BlockHashPositionArrayImplUnchecked

Position array-based representation of the block hash 2.

See also: block_hash_1()

Source

pub fn full_eq(&self, other: &Self) -> bool

Performs full equality checking of the internal structure.

This type intentionally lacks the implementation of PartialEq because of its large size. However, there’s a case where we need to compare two comparison targets.

The primary purpose of this is debugging and it compares all internal members inside the structure (just like auto-generated PartialEq::eq()).

Note that, despite that it is only relevant to users when the unchecked feature is enabled but made public without any features because this method is not unsafe or unchecked in any way.

Source

pub fn init_from<const S1: usize, const S2: usize>( &mut self, hash: impl AsRef<FuzzyHashData<S1, S2, true>>, )

Initialize the object from a given fuzzy hash.

Source

pub fn is_equiv<const S1: usize, const S2: usize>( &self, hash: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> bool

Compare whether two fuzzy hashes are equivalent.

Source

pub unsafe fn raw_score_by_edit_distance_unchecked( len_block_hash_lhs: u8, len_block_hash_rhs: u8, edit_distance: u32, ) -> u32

Available on crate feature unchecked only.

Returns the raw score (without capping) based on the lengths of block hashes and the edit distance between them.

This method assumes that following constraints are satisfied.

§Compatibility Note

The types of len_block_hash_lhs and len_block_hash_rhs were changed from u32 to u8 on the version 0.3.

§Safety
  • Both len_block_hash_lhs and len_block_hash_rhs must satisfy:

  • edit_distance must be equal to or less than len_block_hash_lhs + len_block_hash_rhs - 2 * MIN_LCS_FOR_COMPARISON.

    This constraint comes from the constraints of the lengths and the fact that we shall have a common substring of the length MIN_LCS_FOR_COMPARISON to perform an edit distance-based comparison (reducing maximum possible edit distance by 2 * MIN_LCS_FOR_COMPARISON).

If they are not satisfied, it will at least return a meaningless score and will cause a division by zero on the worst case.

Source

pub fn raw_score_by_edit_distance( len_block_hash_lhs: u8, len_block_hash_rhs: u8, edit_distance: u32, ) -> u32

Returns the raw score (without capping) based on the lengths of block hashes and the edit distance between them.

This method scales the edit distance to the 0..=100 score familiar to humans (100 means a perfect match, smaller the score, lower the similarity).

Note that it doesn’t perform any score capping (that should be performed on smaller block sizes).

See also: “Fuzzy Hash Comparison” section of FuzzyHashData

§Compatibility Note

The types of len_block_hash_lhs and len_block_hash_rhs were changed from u32 to u8 on the version 0.3.

§Usage Constraints
  • Both len_block_hash_lhs and len_block_hash_rhs must satisfy:

  • edit_distance must be equal to or less than len_block_hash_lhs + len_block_hash_rhs - 2 * MIN_LCS_FOR_COMPARISON.

    This constraint comes from the constraints of the lengths and the fact that we shall have a common substring of the length MIN_LCS_FOR_COMPARISON to perform an edit distance-based comparison (reducing maximum possible edit distance by 2 * MIN_LCS_FOR_COMPARISON).

§Useful Property

If all arguments are valid, the return value (the raw score) is guaranteed to be greater than zero. Along with the property of the score capping, it means that we should have a non-zero score if we can perform an edit distance-based comparison.

Source

pub unsafe fn score_cap_on_block_hash_comparison_unchecked( log_block_size: u8, len_block_hash_lhs: u8, len_block_hash_rhs: u8, ) -> u32

Available on crate feature unchecked only.

Returns the “score cap” for a given block size and two block hash lengths, assuming that block size and block hash lengths are small enough so that no arithmetic overflow will occur.

§Safety

Otherwise, it may cause an arithmetic overflow and return an useless value.

Source

pub fn score_cap_on_block_hash_comparison( log_block_size: u8, len_block_hash_lhs: u8, len_block_hash_rhs: u8, ) -> u32

Returns the “score cap” for a given block size and two block hash lengths.

The internal block hash comparison method “caps” the score to prevent exaggerating the matches that are not meaningful enough. This behavior depends on the block size (the “cap” gets higher as the block size gets higher) and the minimum of block hash lengths.

The result is not always guaranteed to be in 0..=100 but 100 or higher means that we don’t need any score capping.

If at least one of the arguments len_block_hash_lhs and len_block_hash_rhs are less than block_hash::MIN_LCS_FOR_COMPARISON, the result is implementation-defined.

If all arguments are valid and log_block_size is large enough, 100 or greater will be returned, meaning that the score capping is no longer required.

See also: “Fuzzy Hash Comparison” section of FuzzyHashData

§Compatibility Note

While this method is completely safe even if semantically-invalid parameters are specified (due to arithmetic properties of internal computation and a safety measure in this method), following semantic constraints may be added on the future versions:

If at least one of the arguments len_block_hash_lhs and len_block_hash_rhs are less than block_hash::MIN_LCS_FOR_COMPARISON, this is semantically-invalid. We haven’t determined whether we need to reject those cases but at least implementation-defined (may cause a panic in the future).

§Useful Property

If all arguments are valid and both len_block_hash_lhs and len_block_hash_rhs are non-zero, the return value (the score cap) is guaranteed to be greater than zero. Along with the property of the raw scoring, it means that we should have a non-zero score if we can perform an edit distance-based comparison.

Source§

impl FuzzyHashCompareTarget

Source

pub fn is_valid(&self) -> bool

Performs full validity checking of the internal structure.

The primary purpose of this is debugging and it should always return true unless…

  • There is a bug in this crate, corrupting this structure,
  • A memory corruption is occurred somewhere else or
  • An unsafe function to construct this object is misused.

Because of its purpose, this method is not designed to be fast.

Note that, despite that it is only relevant to users when the unchecked feature is enabled but made public without any features because this method is not unsafe or unchecked in any way.

Source

pub unsafe fn compare_unequal_near_eq_unchecked<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> u32

Available on crate feature unchecked only.

Compare two fuzzy hashes assuming both are different and their block sizes have a relation of BlockSizeRelation::NearEq.

§Safety
  • Both fuzzy hashes must be different.
  • Both fuzzy hashes (self and other) must have block size relation of BlockSizeRelation::NearEq.

If they are not satisfied, it will return a meaningless score.

Source

pub fn compare_unequal_near_eq<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> u32

Slow: Compare two fuzzy hashes assuming both are different and their block sizes have a relation of BlockSizeRelation::NearEq.

§Usage Constraints
  • Both fuzzy hashes must be different.
  • Both fuzzy hashes (self and other) must have block size relation of BlockSizeRelation::NearEq.
§Performance Consideration

This method’s performance is not good enough (because of constraint checking).

Use those instead:

Source

pub unsafe fn compare_near_eq_unchecked<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> u32

Available on crate feature unchecked only.

Compare two fuzzy hashes assuming their block sizes have a relation of BlockSizeRelation::NearEq.

§Safety

If the condition above is not satisfied, it will return a meaningless score.

Source

pub fn compare_near_eq<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> u32

Compare two fuzzy hashes assuming their block sizes have a relation of BlockSizeRelation::NearEq.

§Usage Constraints
Source

pub unsafe fn compare_unequal_near_lt_unchecked<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> u32

Available on crate feature unchecked only.

Compare two fuzzy hashes assuming both are different and their block sizes have a relation of BlockSizeRelation::NearLt.

§Safety

If they are not satisfied, it will return a meaningless score.

Source

pub fn compare_unequal_near_lt<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> u32

Compare two fuzzy hashes assuming both are different and their block sizes have a relation of BlockSizeRelation::NearLt.

§Usage Constraints
Source

pub unsafe fn compare_unequal_near_gt_unchecked<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> u32

Available on crate feature unchecked only.

Compare two fuzzy hashes assuming both are different and their block sizes have a relation of BlockSizeRelation::NearGt.

§Safety

If they are not satisfied, it will return a meaningless score.

Source

pub fn compare_unequal_near_gt<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> u32

Compare two fuzzy hashes assuming both are different and their block sizes have a relation of BlockSizeRelation::NearGt.

§Usage Constraints
Source

pub unsafe fn compare_unequal_unchecked<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> u32

Available on crate feature unchecked only.

Compare two normalized fuzzy hashes assuming both are different.

§Safety
  • Both fuzzy hashes (self and other) must be different.

If the condition above is not satisfied, it will return a meaningless score.

Source

pub fn compare_unequal<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> u32

Slow: Compare two normalized fuzzy hashes assuming both are different.

§Usage Constraints
  • Both fuzzy hashes (self and other) must be different.
§Performance Consideration

This method’s performance is not good enough (because of the constraint checking).

Use those instead:

Source

pub fn compare<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> u32

Compares two normalized fuzzy hashes.

Source

pub unsafe fn is_comparison_candidate_near_eq_unchecked<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> bool

Available on crate feature unchecked only.

Tests whether other is a candidate for edit distance-based comparison assuming that their block sizes have a relation of BlockSizeRelation::NearEq.

See also: is_comparison_candidate()

§Safety

If the condition above is not satisfied, it will return a meaningless value.

Source

pub fn is_comparison_candidate_near_eq<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> bool

Tests whether other is a candidate for edit distance-based comparison assuming that their block sizes have a relation of BlockSizeRelation::NearEq.

See also: is_comparison_candidate()

§Usage Constraints
Source

pub unsafe fn is_comparison_candidate_near_lt_unchecked<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> bool

Available on crate feature unchecked only.

Tests whether other is a candidate for edit distance-based comparison assuming that their block sizes have a relation of BlockSizeRelation::NearLt.

See also: is_comparison_candidate()

§Safety

If the condition above is not satisfied, it will return a meaningless value.

Source

pub fn is_comparison_candidate_near_lt<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> bool

Tests whether other is a candidate for edit distance-based comparison assuming that their block sizes have a relation of BlockSizeRelation::NearLt.

See also: is_comparison_candidate()

§Usage Constraints
Source

pub unsafe fn is_comparison_candidate_near_gt_unchecked<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> bool

Available on crate feature unchecked only.

Tests whether other is a candidate for edit distance-based comparison assuming that their block sizes have a relation of BlockSizeRelation::NearGt.

See also: is_comparison_candidate()

§Safety

If the condition above is not satisfied, it will return a meaningless value.

Source

pub fn is_comparison_candidate_near_gt<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> bool

Tests whether other is a candidate for edit distance-based comparison assuming that their block sizes have a relation of BlockSizeRelation::NearGt.

See also: is_comparison_candidate()

§Usage Constraints
Source

pub fn is_comparison_candidate<const S1: usize, const S2: usize>( &self, other: impl AsRef<FuzzyHashData<S1, S2, true>>, ) -> bool

Tests whether other is a candidate for edit distance-based comparison.

If this function returns false and self and other are not equivalent, their similarity will be calculated to 0.

§Use Case (Example)

This operation is useful to divide a set of unique (normalized) fuzzy hashes into smaller distinct sets. The similarity score can be non-zero if and only if two fuzzy hashes belong to the same set.

§Safety (Warning)

This function (and its variants) can return false if self and other are equivalent (the base fuzzy hash object of self and other are the same). In this case, their similarity score is 100.

Because of this, we have to use a set of unique fuzzy hash values on the use case above to prevent false-negative matches.

See “Fuzzy Hash Comparison” section of FuzzyHashData for the reason why we need to care about those cases.

§Useful Property

If two fuzzy hashes are correctly provided and this method (or its family) returns true, the similarity score is guaranteed to be greater than zero.

This property can be used to simplify clustering since we are able to prove that the similarity score of two different fuzzy hashes is non-zero if this method (or its family) returns true (i.e. no actual comparison is required to split clusters on single-linkage clustering).

§Advanced Topic: Implementing Equivalents

While this method family can be important to preprocessing on single-linkage clustering, it can be inefficient as the number of fuzzy hash increases. On such cases, precomputing useful information to compute “comparison candidate” relations will help.

FuzzyHashData::block_hash_1_windows() and family methods provide window access to block hash windows to enable implementing functionality equivalent to this method family.

Trait Implementations§

Source§

impl Clone for FuzzyHashCompareTarget

Source§

fn clone(&self) -> FuzzyHashCompareTarget

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for FuzzyHashCompareTarget

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for FuzzyHashCompareTarget

Source§

fn default() -> Self

Returns the “default value” for a type. Read more
Source§

impl<const S1: usize, const S2: usize> From<&FuzzyHashData<S1, S2, true>> for FuzzyHashCompareTarget

Source§

fn from(value: &FuzzyHashData<S1, S2, true>) -> Self

Converts to this type from the input type.
Source§

impl<const S1: usize, const S2: usize, const C1: usize, const C2: usize> From<&FuzzyHashDualData<S1, S2, C1, C2>> for FuzzyHashCompareTarget
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes, ReconstructionBlockSize<S1, C1>: ConstrainedReconstructionBlockSize, ReconstructionBlockSize<S2, C2>: ConstrainedReconstructionBlockSize,

Source§

fn from(value: &FuzzyHashDualData<S1, S2, C1, C2>) -> Self

Converts to this type from the input type.
Source§

impl<const S1: usize, const S2: usize> From<FuzzyHashData<S1, S2, true>> for FuzzyHashCompareTarget

Source§

fn from(value: FuzzyHashData<S1, S2, true>) -> Self

Converts to this type from the input type.
Source§

impl<const S1: usize, const S2: usize, const C1: usize, const C2: usize> From<FuzzyHashDualData<S1, S2, C1, C2>> for FuzzyHashCompareTarget
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes, ReconstructionBlockSize<S1, C1>: ConstrainedReconstructionBlockSize, ReconstructionBlockSize<S2, C2>: ConstrainedReconstructionBlockSize,

Source§

fn from(value: FuzzyHashDualData<S1, S2, C1, C2>) -> Self

Converts to this type from the input type.

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.