pub struct BlockHashPositionArray { /* private fields */ }
Expand description

A position array-based block hash except its length.

Each element of the position array indicates which positions in the corresponding block hash have the given alphabet (note that array index is of the alphabet).

For instance, if representation()[5] == 0x81, it means the block hash contains the alphabet index 5 in the positions 0 and 7 (block hash glob: E??????E*).

This is because bit 0 0x01 at index 5 means that the position 0 is the alphabet index 5 (E). Likewise, bit 7 (0x80) at index 5 corresponds to the fact that position 7 is the alphabet index 5 (E).

This representation makes possible to make certain dynamic programming algorithms bit-parallel. In other words, some table updates of certain top-down dynamic programming algorithms can be represented as logical expressions (with some arithmetic ones to enable e.g. horizontal propagation). This is particularly effective on ssdeep because each block hash has a maximum size of BlockHash::FULL_SIZE (64; many 64-bit machines would handle that efficiently and even 32-bit machines can benefit from).

This is so fast so that the bit-parallel approach is still faster even if we don’t use any batching.

For an example of such algorithms, see Bitap algorithm.

Important Note: Length not included

Note that this struct does not include its length. The length must be given from outside.

Implementations§

source§

impl BlockHashPositionArray

source

pub fn new() -> Self

Creates empty position array-based block hash object without length.

Because this object doesn’t have any characters, it is only valid on length zero.

source

pub fn representation(&self) -> [u64; 64]

Returns raw representation of the block hash position array.

source

pub fn clear(&mut self)

Clears current representation of the block hash.

source

pub fn init_from(&mut self, blockhash: &[u8])

Clear and initialize (encode) the object from given slice.

Usage Constraints
source

pub unsafe fn is_equiv_unchecked(&self, len: u8, other: &[u8]) -> bool

Available on crate feature unsafe only.

Compare whether two block hashes are equivalent.

Safety
  • This object must be valid for given length len.
  • The length of other must not exceed 64.
  • All elements in other must be less than BlockHash::ALPHABET_SIZE.

If those conditions are not satisfied, the behavior is not defined.

source

pub fn is_equiv(&self, len: u8, other: &[u8]) -> bool

Compare whether two block hashes are equivalent.

Usage Constraints
  • This object must be valid for given length len.
  • The length of other must not exceed 64.
  • All elements in other must be less than BlockHash::ALPHABET_SIZE.
source

pub fn is_valid(&self, len: u8) -> bool

Performs full validity checking of a position array considering given length.

source

pub unsafe fn has_common_substring_unchecked( &self, len: u8, other: &[u8] ) -> bool

Available on crate feature unsafe only.

Checks whether given two strings have common substrings with length of MIN_LCS_FOR_BLOCKHASH.

Algorithm Implemented

This function implements Boyer–Moore-like bit-parallel algorithm to find fixed length common substring. The original algorithm is the Backward Shift-Add algorithm for k-LCF problem [Hirvola, 2016] (which searches longest common substring with up to k errors under the Hamming distance).

This algorithm is modified:

  • to search only perfect matches (up to 0 errors),
  • to return as soon as possible if it finds a common substring and
  • to share position array representation with edit_distance (the original algorithm used reverse “incidence matrix”).
Safety
  • This object must be valid for given length len.
  • len (the length of this object) must be 64 or less.
  • All elements in other must be less than BlockHash::ALPHABET_SIZE.

If those conditions are not satisfied, the behavior is not defined.

source

pub fn has_common_substring(&self, len: u8, other: &[u8]) -> bool

Checks whether given two strings have common substrings with length of MIN_LCS_FOR_BLOCKHASH.

Algorithm Implemented

This function implements Boyer–Moore-like bit-parallel algorithm to find fixed length common substring. The original algorithm is the Backward Shift-Add algorithm for k-LCF problem [Hirvola, 2016] (which searches longest common substring with up to k errors under the Hamming distance).

This algorithm is modified:

  • to search only perfect matches (up to 0 errors),
  • to return as soon as possible if it finds a common substring and
  • to share position array representation with edit_distance (the original algorithm used reverse “incidence matrix”).
Usage Constraints
  • This object must be valid for given length len.
  • len (the length of this object) must be 64 or less.
  • All elements in other must be less than BlockHash::ALPHABET_SIZE.
source

pub unsafe fn edit_distance_unchecked(&self, len: u8, other: &[u8]) -> u32

Available on crate feature unsafe only.

Computes the edit distance between two given strings

In specific, it computes the Longest Common Subsequence (LCS) distance, allowing character insertion and deletion as two primitive operations (in cost 1).

While this method requires that this object is not empty, its user, score_strings ensures that both this object and other are at least MIN_LCS_FOR_BLOCKHASH in length.

Algorithm Implemented

[Hyyrö et al., 2005] (doi:10.1007/11427186_33) presented a way to compute so called Indel-Distance using a bit-parallel approach and this method is based on it.

This algorithm is needed to be modified for our purpose because the purpose of the original algorithm is to find a “substring” which is similar to a pattern string.

Safety
  • This object must be valid for given length len.
  • len (the length of this object) must be 64 or less.
  • All elements in other must be less than BlockHash::ALPHABET_SIZE.

If those conditions are not satisfied, the behavior is not defined.

source

pub fn edit_distance(&self, len: u8, other: &[u8]) -> u32

Computes the edit distance between two given strings

In specific, it computes the Longest Common Subsequence (LCS) distance, allowing character insertion and deletion as two primitive operations (in cost 1).

While this method requires that this object is not empty, its user, score_strings ensures that both this object and other are at least MIN_LCS_FOR_BLOCKHASH in length.

Algorithm Implemented

[Hyyrö et al., 2005] (doi:10.1007/11427186_33) presented a way to compute so called Indel-Distance using a bit-parallel approach and this method is based on it.

This algorithm is needed to be modified for our purpose because the purpose of the original algorithm is to find a “substring” which is similar to a pattern string.

Usage Constraints
  • This object must be valid for given length len.
  • len (the length of this object) must be 64 or less.
  • All elements in other must be less than BlockHash::ALPHABET_SIZE.
source

pub unsafe fn score_strings_raw_unchecked(&self, len: u8, other: &[u8]) -> u32

Available on crate feature unsafe only.

Compare two block hashes and computes the similarity score.

This method does not “cap” the score to prevent exaggregating the matches that are not meaningful enough, making this function block size independent.

Safety

If those conditions are not satisfied, the behavior is not defined.

source

pub fn score_strings_raw(&self, len: u8, other: &[u8]) -> u32

Compare two block hashes and computes the similarity score.

This method does not “cap” the score to prevent exaggregating the matches that are not meaningful enough, making this function block size independent.

Usage Constraints
source

pub unsafe fn score_strings_unchecked( &self, len: u8, other: &[u8], log_block_size: u8 ) -> u32

Available on crate feature unsafe only.

Compare two block hashes and computes the similarity score.

This method “caps” the score to prevent exaggregating the matches that are not meaningful enough. This behavior depends on the block size (score cap gets higher when block size gets higher).

Safety

If those conditions are not satisfied, the behavior is not defined.

source

pub fn score_strings(&self, len: u8, other: &[u8], log_block_size: u8) -> u32

Compare two block hashes and computes the similarity score.

This method “caps” the score to prevent exaggregating the matches that are not meaningful enough. This behavior depends on the block size (score cap gets higher when block size gets higher).

Usage Constraints
source

pub const fn element_has_sequences(pa_elem: u64, len: u32) -> bool

Checks whether given position array entry has a sequence of the given length (or longer).

Performance Analysis

This function expects many constant foldings assuming constant len. has_sequences_const forces to do that.

source

pub const fn element_has_sequences_const<const LEN: u32>(pa_elem: u64) -> bool

The same function as element_has_sequences but made to be a generic function.

It improves the performance by intensive constant foldings.

Trait Implementations§

source§

impl Clone for BlockHashPositionArray

source§

fn clone(&self) -> BlockHashPositionArray

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for BlockHashPositionArray

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Default for BlockHashPositionArray

source§

fn default() -> Self

Returns the “default value” for a type. Read more
source§

impl PartialEq<BlockHashPositionArray> for BlockHashPositionArray

source§

fn eq(&self, other: &BlockHashPositionArray) -> bool

This method tests for self and other values to be equal, and is used by ==.
1.0.0 · source§

fn ne(&self, other: &Rhs) -> bool

This method tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
source§

impl Copy for BlockHashPositionArray

source§

impl Eq for BlockHashPositionArray

source§

impl StructuralEq for BlockHashPositionArray

source§

impl StructuralPartialEq for BlockHashPositionArray

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for Twhere T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for Twhere T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for Twhere T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for Twhere U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for Twhere T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.