Struct ssdeep::FuzzyHashData

source ·

#[repr(align(8))]
pub struct FuzzyHashData<const S1: usize, const S2: usize, const NORM: bool>where
    BlockHashSize<S1>: ConstrainedBlockHashSize,
    BlockHashSize<S2>: ConstrainedBlockHashSize,
    BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,{ /* private fields */ }

Expand description

An efficient fixed size fuzzy hash representation.

Fuzzy Hash Internals

A fuzzy hash consists of four parts:

Block size (reciprocal of average piece-splitting probability per byte)
Block hash 1. 6-bit hash per “piece”, variable-length up to BlockHash::FULL_SIZE.

The average piece-splitting probability is given as 1/block_size.
Block hash 2. 6-bit hash per “piece”, variable-length up to either
- BlockHash::HALF_SIZE (truncated / short / regular) or
- BlockHash::FULL_SIZE (non-truncated / long).
The average piece-splitting probability is given as 1/block_size/2).
(optional) The input file name, which is ignored by the parser on this type.

This struct stores first three parts of a fuzzy hash.

You can see the following figure for an example:

196608:DfiQF5UWAC2qctjBemsqz7yHlHr4bMCE2J8Y:jBp/Fqz7mlHZCE2J8Y,"/usr/local/bin/rustc"
\____/|\__________________________________/|\________________/|\____________________/
 |    |            Block hash 1            |   Block hash 2   | File name (optional)
 |    |                                    |                  |
 |    +-- (sep:colon)                      +-- (sep:colon)    +-- (sep,comma (optional))
 |
 +-- Block size

In the example above, 1 / 196 608 is the average probability for piece-splitting per byte on the block hash 1. On the block hash 2, the probability is 1 / 393 216 per byte, half of the probability on the block hash 1.

Since ssdeep uses a 32-bit hash function to decide whether to perform a piece-splitting, this probability will get inaccurate as the block size gets larger.

There is an important property of the block size: all valid block sizes can be represented as BlockSize::MIN * 2ⁿ (n ≧ 0).

In this crate, the block size is stored as n (the base-2 logarithm form of the block size) for higher efficiency. log_block_size() method returns this raw representation. If you need an actual block size as used in the string representation, block_size() can be used instead.

Block Hashes

A fuzzy hash has two block hashes (1 and 2).

They are variable-length fields that store an array of 6-bit “piece” hash values (represented as Base64 characters in the string representation and internally stored as Base64 indices).

Relations with Block Size

The reason a fuzzy hash having two block hashes is, to enable comparing fuzzy hashes with similar block sizes (but not too far).

In principle, we can only compare block hashes with the same block size directly. Think following fuzzy hash for example:

6144:SIsMYod+X3oI+YnsMYod+X3oI+YZsMYod+X3oI+YLsMYod+X3oI+YQ:Z5d+X395d+X3X5d+X315d+X3+
     \____________________________________________________/ \_______________________/
      Block hash 1                                                      Block hash 2
      (effective block size: 6144)                      (effective block size: 12288)
                                                               [*] 12288 == 6144 * 2

You can easily compare it with another fuzzy hash with the same block size (but actual block hash similarity scoring only occurs after checking common substring).

Unaligned:
[A] 6144:SIsMYod+X3oI+YnsMYod+X3oI+YZsMYod+X3oI+YLsMYod+X3oI+YQ:Z5d+X395d+X3X5d+X315d+X3+
[B] 6144:SAsMYod+X3oI+YEWnnsMYod+X3oI+Y5sMYod+X3oI+YLsMYod+X3oI+YQ:H5d+X36WnL5d+X3v5d+X315d+X3+

Aligned:
[A] 6144:SIsMYod+X3oI+YnsMYod+X3oI+YZsMYod+X3oI+YLsMYod+X3oI+YQ   :Z5d+X395d+X3X5d+X315d+X3+
[B] 6144:SAsMYod+X3oI+YEWnnsMYod+X3oI+Y5sMYod+X3oI+YLsMYod+X3oI+YQ:H5d+X36WnL5d+X3v5d+X315d+X3+
         \_______________________________________________________/ \__________________________/
                               Comparison 1                                Comparison 2
                      (score([A1], [B1], 6144) = 94)            (score([A2], [B2], 12288) = 85)

score_final([A], [B], 6144) = max(94, 85) = 94

The final similarity score is the maximum of two block hash comparisons (note that the score function on small block sizes will cap the score to prevent exaggeration of matches).

If you have two fuzzy hashes with different block sizes but they are near enough, we can still perform a block hash comparison.

Unaligned:
[A] 3072:S+IiyfkMY+BES09JXAnyrZalI+YuyfkMY+BES09JXAnyrZalI+YQ:S+InsMYod+X3oI+YLsMYod+X3oI+YQ
[B] 6144:SIsMYod+X3oI+YnsMYod+X3oI+YZsMYod+X3oI+YLsMYod+X3oI+YQ:Z5d+X395d+X3X5d+X315d+X3+
[C] 12288:Z5d+X3pz5d+X3985d+X3X5d+X315d+X3+:1+Jr+d++H+5+e

Aligned:
[A] 3072 :S+IiyfkMY+BES09JXAnyrZalI+YuyfkMY+BES09JXAnyrZalI+YQ:S+InsMYod+X3oI+YLsMYod+X3oI+YQ
[B] 6144 :                                                     SIsMYod+X3oI+YnsMYod+X3oI+YZsMYod+X3oI+YLsMYod+X3oI+YQ:Z5d+X395d+X3X5d+X315d+X3+
[C] 12288:                                                                                                            Z5d+X3pz5d+X3985d+X3X5d+X315d+X3+:1+Jr+d++H+5+e
          \__________________________________________________/ \____________________________________________________/ \_______________________________/ \___________/
           Eff.B.S.=3072                                        Eff.B.S.=6144                                          Eff.B.S.=12288                    Eff.B.S.=24576
                                                                Comparison between [A2] and [B1]                       Comparison between [B2] and [C1]
                                                                (score([A2], [B1], 6144) = 72)                         (score([B2], [C1], 12288) = 88)

score_final([A], [B], 3072) = score([A2], [B1],  6144) = 72
score_final([B], [C], 6144) = score([B2], [C1], 12288) = 88
score_final([A], [C], 3072) = 0 (since there's no block hashes to compare)

Such cases are handled with BlockSizeRelation and BlockSize utility functions. We can outline the relation in the table below. Note that each block size is denoted as “Actual block size (block size in base-2 logarithm)”.

Left (`lhs`)	Right (`rhs`)	Relation
3072 (10)	6144 (11)	`NearLt`
6144 (11)	3072 (10)	`NearGt`
6144 (11)	6144 (11)	`NearEq`
6144 (11)	12288 (12)	`NearLt`
12288 (12)	6144 (11)	`NearGt`
3072 (10)	12288 (12)	`Far`

On highly optimized clustering applications, being aware of the block size relation will be crucial.

Normalization

To prevent exaggerating the comparison score from repeating patterns, ssdeep processes each block hash before comparison so that a sequence consisting of the same character longer than BlockHash::MAX_SEQUENCE_SIZE cannot exist.

For instance, after processing a block hash 122333444455555, it is converted to 122333444555 (four 4s and five 5s are shortened into three 4s and three 5s because BlockHash::MAX_SEQUENCE_SIZE is defined to be three (3)).

In this crate, this process is called normalization.

ssdeep normally generates (as well as Generator) not normalized, raw fuzzy hashes. So, making a distinction between normalized and raw forms are important.

Truncation

ssdeep normally generates (as well as Generator) truncated fuzzy hashes. In the truncated fuzzy hash, length of block hash 2 is limited to BlockHash::HALF_SIZE, half of the maximum length of block hash 1 (BlockHash::FULL_SIZE).

While libfuzzy allows generating non-truncated, long fuzzy hashes, they are typically useless. So, most operations are performed in short, truncated fuzzy hashes by default. Short variants of FuzzyHashData is smaller than longer variants so it can be used to reduce memory footprint.

Struct ssdeep::FuzzyHashData

Implementations§

impl<const S1: usize, const S2: usize> FuzzyHashData<S1, S2, true>where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

pub fn compare(&self, other: impl AsRef<Self>) -> u32

pub unsafe fn compare_unequal_unchecked(&self, other: impl AsRef<Self>) -> u32

pub fn compare_unequal(&self, other: impl AsRef<Self>) -> u32

impl<const S1: usize, const S2: usize, const NORM: bool> FuzzyHashData<S1, S2, NORM>where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

pub const MAX_BLOCK_HASH_SIZE_1: usize = S1

pub const MAX_BLOCK_HASH_SIZE_2: usize = S2

pub const IS_NORMALIZED_FORM: bool = NORM

pub const IS_LONG_FORM: bool = _

pub fn new() -> Self

pub unsafe fn init_from_internals_raw_unchecked( &mut self, log_block_size: u8, block_hash_1: &[u8; S1], block_hash_2: &[u8; S2], block_hash_1_len: u8, block_hash_2_len: u8 )

pub fn init_from_internals_raw( &mut self, log_block_size: u8, block_hash_1: &[u8; S1], block_hash_2: &[u8; S2], block_hash_1_len: u8, block_hash_2_len: u8 )

pub unsafe fn new_from_internals_raw_unchecked( log_block_size: u8, block_hash_1: &[u8; S1], block_hash_2: &[u8; S2], block_hash_1_len: u8, block_hash_2_len: u8 ) -> Self

pub fn new_from_internals_raw( log_block_size: u8, block_hash_1: &[u8; S1], block_hash_2: &[u8; S2], block_hash_1_len: u8, block_hash_2_len: u8 ) -> Self

pub unsafe fn new_from_internals_unchecked( block_size: u32, block_hash_1: &[u8], block_hash_2: &[u8] ) -> Self

pub fn new_from_internals( block_size: u32, block_hash_1: &[u8], block_hash_2: &[u8] ) -> Self

pub fn log_block_size(&self) -> u8

pub fn block_size(&self) -> u32

pub fn block_hash_1(&self) -> &[u8] ⓘ

pub fn block_hash_1_as_array(&self) -> &[u8; S1]

pub fn block_hash_1_len(&self) -> usize

pub fn block_hash_2(&self) -> &[u8] ⓘ

pub fn block_hash_2_as_array(&self) -> &[u8; S2]

pub fn block_hash_2_len(&self) -> usize

pub fn len_in_str(&self) -> usize

pub const MAX_LEN_IN_STR: usize = _

pub fn to_string(&self) -> String

pub fn store_into_bytes( &self, buffer: &mut [u8] ) -> Result<(), FuzzyHashOperationError>

pub fn from_bytes(str: &[u8]) -> Result<Self, ParseError>

pub fn normalize_in_place(&mut self)

pub fn is_valid(&self) -> bool

pub fn full_eq(&self, other: &Self) -> bool

pub fn compare_block_sizes( lhs: impl AsRef<Self>, rhs: impl AsRef<Self> ) -> BlockSizeRelation

pub fn is_block_sizes_near(lhs: impl AsRef<Self>, rhs: impl AsRef<Self>) -> bool

pub fn is_block_sizes_near_eq( lhs: impl AsRef<Self>, rhs: impl AsRef<Self> ) -> bool

pub fn is_block_sizes_near_lt( lhs: impl AsRef<Self>, rhs: impl AsRef<Self> ) -> bool

pub fn is_block_sizes_near_gt( lhs: impl AsRef<Self>, rhs: impl AsRef<Self> ) -> bool

pub fn cmp_by_block_size(&self, other: &Self) -> Ordering

impl<const S1: usize, const S2: usize> FuzzyHashData<S1, S2, { _ }>where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

pub fn from_raw_form(source: &FuzzyHashData<S1, S2, { _ }>) -> Self

pub fn to_raw_form(&self) -> FuzzyHashData<S1, S2, { _ }>

pub fn into_mut_raw_form(&self, dest: &mut FuzzyHashData<S1, S2, { _ }>)

pub fn normalize(&self) -> FuzzyHashData<S1, S2, { _ }>

pub fn clone_normalized(&self) -> Self

pub fn is_normalized(&self) -> bool

impl<const S1: usize, const S2: usize> FuzzyHashData<S1, S2, { _ }>where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

pub fn from_normalized(source: &FuzzyHashData<S1, S2, { _ }>) -> Self

pub fn normalize(&self) -> FuzzyHashData<S1, S2, { _ }>

pub fn clone_normalized(&self) -> Self

pub fn is_normalized(&self) -> bool

impl<const NORM: bool> FuzzyHashData<{ BlockHash::FULL_SIZE }, { BlockHash::HALF_SIZE }, NORM>

pub fn to_long_form( &self ) -> FuzzyHashData<{ BlockHash::FULL_SIZE }, { BlockHash::FULL_SIZE }, NORM>

pub fn into_mut_long_form( &self, dest: &mut FuzzyHashData<{ BlockHash::FULL_SIZE }, { BlockHash::FULL_SIZE }, NORM> )

impl<const NORM: bool> FuzzyHashData<{ BlockHash::FULL_SIZE }, { BlockHash::FULL_SIZE }, NORM>

pub fn from_short_form( source: &FuzzyHashData<{ BlockHash::FULL_SIZE }, { BlockHash::HALF_SIZE }, NORM> ) -> Self

pub fn try_into_mut_short( &self, dest: &mut FuzzyHashData<{ BlockHash::FULL_SIZE }, { BlockHash::HALF_SIZE }, NORM> ) -> Result<(), FuzzyHashOperationError>

Trait Implementations§

impl<const S1: usize, const S2: usize, const NORM: bool> AsRef<FuzzyHashData<S1, S2, NORM>> for FuzzyHashData<S1, S2, NORM>where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

fn as_ref(&self) -> &FuzzyHashData<S1, S2, NORM>

fn as_ref(&self) -> &FuzzyHashData<S1, S2, true>

impl<const S1: usize, const S2: usize, const NORM: bool> Clone for FuzzyHashData<S1, S2, NORM>where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

fn clone(&self) -> FuzzyHashData<S1, S2, NORM>

fn clone_from(&mut self, source: &Self)

impl<const S1: usize, const S2: usize, const NORM: bool> Debug for FuzzyHashData<S1, S2, NORM>where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl<const S1: usize, const S2: usize, const NORM: bool> Default for FuzzyHashData<S1, S2, NORM>where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

fn default() -> Self

impl<const S1: usize, const S2: usize, const NORM: bool> Display for FuzzyHashData<S1, S2, NORM>where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl<const S1: usize, const S2: usize> From<&FuzzyHashData<S1, S2, true>> for FuzzyHashCompareTargetwhere BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

fn from(value: &FuzzyHashData<S1, S2, true>) -> Self

impl<const NORM: bool> From<FuzzyHashData<{BlockHash::FULL_SIZE}, {BlockHash::HALF_SIZE}, NORM>> for FuzzyHashData<{ BlockHash::FULL_SIZE }, { BlockHash::FULL_SIZE }, NORM>

fn from( value: FuzzyHashData<{ BlockHash::FULL_SIZE }, { BlockHash::HALF_SIZE }, NORM> ) -> Self

impl From<FuzzyHashData<{BlockHash::FULL_SIZE}, {BlockHash::HALF_SIZE}, true>> for FuzzyHashData<{ BlockHash::FULL_SIZE }, { BlockHash::FULL_SIZE }, false>

fn from( value: FuzzyHashData<{ BlockHash::FULL_SIZE }, { BlockHash::HALF_SIZE }, true> ) -> Self

impl<const S1: usize, const S2: usize, const NORM: bool> From<FuzzyHashData<S1, S2, NORM>> for Stringwhere BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,

fn from(value: FuzzyHashData<S1, S2, NORM>) -> Self

impl<const S1: usize, const S2: usize> From<FuzzyHashData<S1, S2, false>> for FuzzyHashData<S1, S2, { _ }>where BlockHashSize<S1>: ConstrainedBlockHashSize, BlockHashSize<S2>: ConstrainedBlockHashSize, BlockHashSizes<S1, S2>: ConstrainedBlockHashSizes,