Struct FuzzyHashData

Source

#[repr(align(8))]
pub struct FuzzyHashData<const S1: usize, const S2: usize, const NORM: bool>where
    BlockHashSize<S1>: ConstrainedBlockHashSize,
    BlockHashSize<S2>: ConstrainedBlockHashSize,
    BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,
{ /* private fields */ }

Expand description

An efficient fixed size fuzzy hash representation.

§Fuzzy Hash Internals

A fuzzy hash consists of four parts:

Block size (reciprocal of average piece-splitting probability per byte on the block hash 1)
Block hash 1. 6-bit hash (a block hash alphabet) per “piece”, variable-length up to block_hash::FULL_SIZE.

The average piece-splitting probability is given as 1/block_size.
Block hash 2. 6-bit hash (a block hash alphabet) per “piece”, variable-length up to either
- block_hash::HALF_SIZE (truncated / short / regular) or
- block_hash::FULL_SIZE (non-truncated / long).
The average piece-splitting probability is given as 1/block_size/2.
(optional) The input file name, which is ignored by the parser on this type.

This struct stores first three parts of a fuzzy hash.

You can see the following figure for an example:

196608:DfiQF5UWAC2qctjBemsqz7yHlHr4bMCE2J8Y:jBp/Fqz7mlHZCE2J8Y,"/usr/local/bin/rustc"
\____/|\__________________________________/|\________________/|\____________________/
 |    |            Block hash 1            |   Block hash 2   | File name (optional)
 |    |                                    |                  |
 |    +-- (sep:colon)                      +-- (sep:colon)    +-- (sep,comma (optional))
 |
 +-- Block size

§Block Size

In the example above, 1 / 196 608 is the average probability for piece-splitting per byte on the block hash 1. On the block hash 2, the probability is 1 / 393 216 per byte, half of the probability on the block hash 1.

Since ssdeep uses a 32-bit hash function to decide whether to perform a piece-splitting, this probability will get inaccurate as the block size gets larger.

There is an important property of the block size: all valid block sizes can be represented as block_size::MIN * 2ⁿ (a power of two where n ≧ 0).

In this crate, the block size is stored as n (the base-2 logarithm form of the block size) for higher efficiency. log_block_size() method returns this raw representation. If you need an actual block size as used in the string representation, block_size() can be used instead.

§Block Hashes

A fuzzy hash has two block hashes (1 and 2).

They are variable-length fields that store an array of 6-bit “piece” hash values (represented as Base64 characters in the string representation and internally stored as Base64 indices).

§Relations with Block Size

The reason a fuzzy hash having two block hashes is, to enable comparing fuzzy hashes with similar block sizes (but not too far).

In principle, we can only compare block hashes with the same effective block size directly. Think following fuzzy hash for example:

6144:SIsMYod+X3oI+YnsMYod+X3oI+YZsMYod+X3oI+YLsMYod+X3oI+YQ:Z5d+X395d+X3X5d+X315d+X3+
     \____________________________________________________/ \_______________________/
      Block hash 1                                                      Block hash 2
      (effective block size: 6144)                      (effective block size: 12288)
                                                               [*] 12288 == 6144 * 2

You can easily compare it with another fuzzy hash with the same block size (but actual block hash similarity scoring only occurs after checking common substring).

Unaligned:
[A] 6144:SIsMYod+X3oI+YnsMYod+X3oI+YZsMYod+X3oI+YLsMYod+X3oI+YQ:Z5d+X395d+X3X5d+X315d+X3+
[B] 6144:SAsMYod+X3oI+YEWnnsMYod+X3oI+Y5sMYod+X3oI+YLsMYod+X3oI+YQ:H5d+X36WnL5d+X3v5d+X315d+X3+

Aligned:
[A] 6144:SIsMYod+X3oI+YnsMYod+X3oI+YZsMYod+X3oI+YLsMYod+X3oI+YQ   :Z5d+X395d+X3X5d+X315d+X3+
[B] 6144:SAsMYod+X3oI+YEWnnsMYod+X3oI+Y5sMYod+X3oI+YLsMYod+X3oI+YQ:H5d+X36WnL5d+X3v5d+X315d+X3+
         \_______________________________________________________/ \__________________________/
                               Comparison 1                                Comparison 2
                      (score([A1], [B1], 6144) = 94)            (score([A2], [B2], 12288) = 85)

score_final([A], [B], 6144) = max(94, 85) = 94

The final similarity score is the maximum of two block hash comparisons (note that the score will be capped on small effective block sizes to prevent exaggeration of matches).

If you have two fuzzy hashes with different block sizes but they are near enough, we can still perform a block hash comparison.

Unaligned:
[A] 3072:S+IiyfkMY+BES09JXAnyrZalI+YuyfkMY+BES09JXAnyrZalI+YQ:S+InsMYod+X3oI+YLsMYod+X3oI+YQ
[B] 6144:SIsMYod+X3oI+YnsMYod+X3oI+YZsMYod+X3oI+YLsMYod+X3oI+YQ:Z5d+X395d+X3X5d+X315d+X3+
[C] 12288:Z5d+X3pz5d+X3985d+X3X5d+X315d+X3+:1+Jr+d++H+5+e

Aligned:
[A] 3072 :S+IiyfkMY+BES09JXAnyrZalI+YuyfkMY+BES09JXAnyrZalI+YQ:S+InsMYod+X3oI+YLsMYod+X3oI+YQ
[B] 6144 :                                                     SIsMYod+X3oI+YnsMYod+X3oI+YZsMYod+X3oI+YLsMYod+X3oI+YQ:Z5d+X395d+X3X5d+X315d+X3+
[C] 12288:                                                                                                            Z5d+X3pz5d+X3985d+X3X5d+X315d+X3+:1+Jr+d++H+5+e
          \__________________________________________________/ \____________________________________________________/ \_______________________________/ \___________/
           Eff.B.S.=3072                                        Eff.B.S.=6144                                          Eff.B.S.=12288                    Eff.B.S.=24576
                                                                Comparison between [A2] and [B1]                       Comparison between [B2] and [C1]
                                                                (score([A2], [B1], 6144) = 72)                         (score([B2], [C1], 12288) = 88)

score_final([A], [B], 3072) = score([A2], [B1],  6144) = 72
score_final([B], [C], 6144) = score([B2], [C1], 12288) = 88
score_final([A], [C], 3072) = 0 (since there's no block hashes to compare)

Such cases are handled with BlockSizeRelation and block_size utility functions. We can outline the relation in the table below. Note that each (effective) block size is denoted as “Actual raw block size (block size in base-2 logarithm)”.

Left (`lhs`)	Right (`rhs`)	Relation
3072 (10)	6144 (11)	`NearLt`
6144 (11)	3072 (10)	`NearGt`
6144 (11)	6144 (11)	`NearEq`
6144 (11)	12288 (12)	`NearLt`
12288 (12)	6144 (11)	`NearGt`
3072 (10)	12288 (12)	`Far`

On highly optimized clustering applications, being aware of the block size relation will be crucial.

§Normalization

To prevent exaggerating the comparison score from repeating patterns, ssdeep processes each block hash before comparison so that a sequence consisting of the same character longer than block_hash::MAX_SEQUENCE_SIZE cannot exist.

For instance, after processing a block hash 122333444455555 before comparison, it is converted to 122333444555 (four 4s and five 5s are shortened into three 4s and three 5s because block_hash::MAX_SEQUENCE_SIZE is defined to be three (3)).

In this crate, this process is called normalization.

ssdeep normally generates (as well as Generator) not normalized, raw fuzzy hashes. So, making a distinction between normalized and raw forms are important.

§The Strict Parser

If the strict-parser feature is enabled, parsers for fuzzy hashing types will reject ones that would cause an error on the raw variant but not on the normalized variant (on the default parser i.e. if this feature is disabled).

Enabling this feature comes with a cost in performance but it will make the parser less confusing (if either of the variants accepts a string, another will).

§Truncation

ssdeep normally generates (as well as Generator) truncated fuzzy hashes. In the truncated fuzzy hash, length of block hash 2 is limited to block_hash::HALF_SIZE, half of the maximum length of block hash 1 (block_hash::FULL_SIZE).

While libfuzzy allows generating non-truncated, long fuzzy hashes, they are typically useless. So, most operations are performed in short, truncated fuzzy hashes by default. Short variants of FuzzyHashData is smaller than longer variants so it can be used to reduce memory footprint.

§Warning: Truncation is not just “Truncation”

Truncated (regular) fuzzy hashes are not literally “truncated” from the long, non-truncated fuzzy hashes (but individually generated).

For instance (/usr/libexec/geoclue on Ubuntu 23.10):

                                                                                                                                  v
Non-truncated (long):        6144:M5/qVhAWFfzlpxdJ/YQINNbZ2cQpn77+Ptn+7ADOeb8Gj+OK8o4u1TzxwBf71C3O:M5/qzAWFfzlpxdJ/YQINNbZ2cQpn77+Ptn+7ADOeb8Gj+OK8o4u1TzxwBf71ETfJ
Truncated (short / regular): 6144:M5/qVhAWFfzlpxdJ/YQINNbZ2cQpn77+Ptn+7ADOeb8Gj+OK8o4u1TzxwBf71C3O:M5/qzAWFfzlpxdJ/YQINNbZ2cQpn77+i

Beware that the 32nd character of the block hash 2 are different (P and i). This is because the last character of a block hash may contain the information after all other individually stored pieces.

§Fuzzy Hash Comparison

For the basic concept of the comparison, see the “Relations with Block Size” section.

In this section, we describe the full comparison algorithm.

If two normalized hashes A and B are completely the same, the similarity score is 100 (a perfect match) no matter what.

This case is not subject to the edit distance-based scoring. For instance, FuzzyHashCompareTarget::is_comparison_candidate() may return false on such cases.

So, this case must be handled separately.
For each block hash pair (in which the effective block size match), compute the sub-similarity score (between bhA and bhB) as follows:
1. Search for a common substring of the length of block_hash::MIN_LCS_FOR_COMPARISON or longer.
  
  If we could not find one, the sub-similarity score is 0 and no edit distance-based scoring is performed.
  
  Note: if we could find one (i.e. can perform edit distance-based comparison), the sub-similarity score (and the final score) is guaranteed to be greater than zero. That means we won’t need to split a cluster (on single-linkage clustering) if all unique elements in the cluster are directly or indirectly connected by “candidate of edit distance-based comparison” relations.
2. Compute the edit distance between two block hashes and scale it
  - from 0..=(bhA.len()+bhB.len()) (0 is the perfect match)
  - to 0..=100 (100 is the perfect match).
  Note: this scaling takes multiple steps (for a historical reason) and see the source code for the exact behavior (including rounding-related one).
3. For small effective block sizes, cap the score to prevent exaggerating the matches.
Take the maximum of sub-similarity scores (0 if there’s no sub-similarity scores i.e. block sizes are far).

For actual comparison, a FuzzyHashCompareTarget object or corresponding half-baked object is used.

Struct FuzzyHashData Copy item path

§Fuzzy Hash Internals

§Block Size

§Block Hashes

§Relations with Block Size

§Normalization

§The Strict Parser

§Truncation

§Warning: Truncation is not just “Truncation”

§Fuzzy Hash Comparison

Implementations§

impl<const S1: usize, const S2: usize> FuzzyHashData<S1, S2, true>where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

pub fn compare(&self, other: impl AsRef<Self>) -> u32

pub unsafe fn compare_unequal_unchecked(&self, other: impl AsRef<Self>) -> u32

§Safety

pub fn compare_unequal(&self, other: impl AsRef<Self>) -> u32

§Usage Constraints

§Performance Consideration

impl<const S1: usize, const S2: usize, const NORM: bool> FuzzyHashData<S1, S2, NORM>where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

pub const MAX_BLOCK_HASH_SIZE_1: usize = S1

pub const MAX_BLOCK_HASH_SIZE_2: usize = S2

pub const IS_NORMALIZED_FORM: bool = NORM

pub const IS_LONG_FORM: bool

pub const MAX_LEN_IN_STR: usize

pub fn new() -> Self

pub unsafe fn init_from_internals_raw_unchecked( &mut self, log_block_size: u8, block_hash_1: &[u8; S1], block_hash_2: &[u8; S2], block_hash_1_len: u8, block_hash_2_len: u8, )

§Safety

pub fn init_from_internals_raw( &mut self, log_block_size: u8, block_hash_1: &[u8; S1], block_hash_2: &[u8; S2], block_hash_1_len: u8, block_hash_2_len: u8, )

§Usage Constraints

pub unsafe fn new_from_internals_raw_unchecked( log_block_size: u8, block_hash_1: &[u8; S1], block_hash_2: &[u8; S2], block_hash_1_len: u8, block_hash_2_len: u8, ) -> Self

§Safety

pub fn new_from_internals_raw( log_block_size: u8, block_hash_1: &[u8; S1], block_hash_2: &[u8; S2], block_hash_1_len: u8, block_hash_2_len: u8, ) -> Self

§Usage Constraints

pub unsafe fn new_from_internals_near_raw_unchecked( log_block_size: u8, block_hash_1: &[u8], block_hash_2: &[u8], ) -> Self

§Safety

pub fn new_from_internals_near_raw( log_block_size: u8, block_hash_1: &[u8], block_hash_2: &[u8], ) -> Self

§Usage Constraints

pub unsafe fn new_from_internals_unchecked( block_size: u32, block_hash_1: &[u8], block_hash_2: &[u8], ) -> Self

§Safety

pub fn new_from_internals( block_size: u32, block_hash_1: &[u8], block_hash_2: &[u8], ) -> Self

§Usage Constraints

pub fn log_block_size(&self) -> u8

pub fn block_size(&self) -> u32

pub fn block_hash_1(&self) -> &[u8] ⓘ

§Safety

pub fn block_hash_1_as_array(&self) -> &[u8; S1]

pub fn block_hash_1_len(&self) -> usize

pub fn block_hash_2(&self) -> &[u8] ⓘ

§Safety

pub fn block_hash_2_as_array(&self) -> &[u8; S2]

pub fn block_hash_2_len(&self) -> usize

pub fn len_in_str(&self) -> usize

pub fn to_string(&self) -> String

pub fn store_into_bytes( &self, buffer: &mut [u8], ) -> Result<usize, FuzzyHashOperationError>

§Compatibility Note

pub fn from_bytes_with_last_index( str: &[u8], index: &mut usize, ) -> Result<Self, ParseError>

pub fn from_bytes(str: &[u8]) -> Result<Self, ParseError>

pub fn is_normalized(&self) -> bool

pub fn normalize_in_place(&mut self)

pub fn normalize(&self) -> FuzzyHashData<S1, S2, true>

pub fn clone_normalized(&self) -> Self

pub fn is_valid(&self) -> bool

§Safety: No Panic Guarantee

pub fn full_eq(&self, other: &Self) -> bool

§Safety: No Panic Guarantee

pub fn compare_block_sizes( lhs: impl AsRef<Self>, rhs: impl AsRef<Self>, ) -> BlockSizeRelation

pub fn is_block_sizes_near(lhs: impl AsRef<Self>, rhs: impl AsRef<Self>) -> bool

§Compatibility Notice

pub fn is_block_sizes_near_eq( lhs: impl AsRef<Self>, rhs: impl AsRef<Self>, ) -> bool

§Compatibility Notice

pub fn is_block_sizes_near_lt( lhs: impl AsRef<Self>, rhs: impl AsRef<Self>, ) -> bool

§Compatibility Notice

pub fn is_block_sizes_near_gt( lhs: impl AsRef<Self>, rhs: impl AsRef<Self>, ) -> bool

§Compatibility Notice

pub fn cmp_by_block_size(&self, other: &Self) -> Ordering

impl<const S1: usize, const S2: usize> FuzzyHashData<S1, S2, true>where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

pub fn block_hash_1_windows(&self) -> Windows<'_, u8>

§Example (pseudo code)

pub fn block_hash_1_numeric_windows(&self) -> NumericWindows<'_> ⓘ

pub fn block_hash_1_index_windows(&self) -> IndexWindows<'_> ⓘ

Struct FuzzyHashData

impl<const S1: usize, const S2: usize> FuzzyHashData<S1, S2, true>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize, const NORM: bool> FuzzyHashData<S1, S2, NORM>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize> FuzzyHashData<S1, S2, true>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize> FuzzyHashData<S1, S2, false>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize, const NORM: bool> AsRef<FuzzyHashData<S1, S2, NORM>> for FuzzyHashData<S1, S2, NORM>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize, const NORM: bool> Clone for FuzzyHashData<S1, S2, NORM>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize, const NORM: bool> Debug for FuzzyHashData<S1, S2, NORM>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize, const NORM: bool> Default for FuzzyHashData<S1, S2, NORM>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize, const NORM: bool> Display for FuzzyHashData<S1, S2, NORM>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize> From<&FuzzyHashData<S1, S2, true>> for FuzzyHashCompareTarget
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize, const NORM: bool> From<FuzzyHashData<S1, S2, NORM>> for String
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize> From<FuzzyHashData<S1, S2, false>> for FuzzyHashData<S1, S2, true>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize> From<FuzzyHashData<S1, S2, true>> for FuzzyHashCompareTarget
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize> From<FuzzyHashData<S1, S2, true>> for FuzzyHashData<S1, S2, false>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize, const NORM: bool> FromStr for FuzzyHashData<S1, S2, NORM>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

impl<const S1: usize, const S2: usize, const NORM: bool> Hash for FuzzyHashData<S1, S2, NORM>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

fn hash_slice<H>(data: &[Self], state: &mut H)
where H: Hasher, Self: Sized,

impl<const S1: usize, const S2: usize, const NORM: bool> Ord for FuzzyHashData<S1, S2, NORM>
where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

fn max(self, other: Self) -> Self
where Self: Sized,