pub struct BlockHashPositionArray { /* private fields */ }
Expand description

A position array-based block hash except its length.

Each element of the position array indicates which positions in the corresponding block hash has the given alphabet (note that the array index is of the alphabet).

For instance, if representation()[5] == 0x81, it means the block hash contains the alphabet index 5 in the positions 0 and 7 (block hash glob: E??????E*).

This is because the bit 0 (0x01) at the index 5 means that position 0 has the alphabet with index 5 (E). Likewise, the bit 7 (0x80) at the index 5 corresponds to the fact that position 7 has the alphabet with index 5 (E).

This representation makes it possible to make some dynamic programming algorithms bit-parallel. In other words, some table updates of certain top-down dynamic programming algorithms can be represented as logical expressions (with some arithmetic ones to enable, for instance, horizontal propagation). This is particularly effective on ssdeep because each block hash has a maximum size of BlockHash::FULL_SIZE (64; many 64-bit machines would handle that efficiently and even 32-bit machines can benefit from).

This is so fast that the bit-parallel approach is still faster even if we don’t use any batching.

For an example of such algorithms, see Bitap algorithm.

Important Note: Length not included

Note that this struct does not contain its length inside. The length must be given from outside each time you call the methods.

Incompatibility Notice

In v0.2.0, this struct is completely incompatible from v0.1.x.

Implementations§

source§

impl BlockHashPositionArray

source

pub fn new() -> Self

Creates empty position array-based block hash object without length.

Because the resulting object doesn’t contain any characters, it is only valid on length zero.

source

pub fn representation(&self) -> [u64; 64]

Returns the raw representation of the block hash position array.

source

pub fn clear(&mut self)

Clears the current representation of the block hash.

source

pub fn init_from(&mut self, blockhash: &[u8])

Clear and initialize (encode) the object from a given slice.

Usage Constraints
source

pub unsafe fn is_equiv_unchecked(&self, len: u8, other: &[u8]) -> bool

Available on crate feature unsafe only.

Compare whether two block hashes are equivalent.

Safety
  • This object must be valid on a given length len.
  • The length of other must not exceed 64.
  • All elements in other must be less than BlockHash::ALPHABET_SIZE.

If they are not satisfied, it will return a meaningless value.

source

pub fn is_equiv(&self, len: u8, other: &[u8]) -> bool

Compare whether two block hashes are equivalent.

Usage Constraints
  • This object must be valid on a given length len.
  • The length of other must not exceed 64.
  • All elements in other must be less than BlockHash::ALPHABET_SIZE.
source

pub fn is_valid(&self, len: u8) -> bool

Performs full validity checking of a position array considering a given length.

source

pub unsafe fn has_common_substring_unchecked( &self, len: u8, other: &[u8] ) -> bool

Available on crate feature unsafe only.

Checks whether two given strings have common substrings with a length of MIN_LCS_FOR_BLOCKHASH.

Algorithm Implemented

This function implements a Boyer–Moore-like bit-parallel algorithm to find a fixed-length common substring. The original algorithm is the Backward Shift-Add algorithm for the k-LCF problem [Hirvola, 2016] (which searches the longest common substring with up to k errors under the Hamming distance).

This algorithm is modified:

  • to search only perfect matches (up to 0 errors),
  • to return as soon as possible if it finds a common substring and
  • to share the position array representation with edit_distance (the original algorithm used reverse “incidence matrix”).
Safety
  • This object must be valid on a given length len.
  • len (the length of this object) must not exceed 64.
  • All elements in other must be less than BlockHash::ALPHABET_SIZE.

If they are not satisfied, it will return a meaningless value.

source

pub fn has_common_substring(&self, len: u8, other: &[u8]) -> bool

Checks whether two given strings have common substrings with a length of MIN_LCS_FOR_BLOCKHASH.

Algorithm Implemented

This function implements a Boyer–Moore-like bit-parallel algorithm to find a fixed-length common substring. The original algorithm is the Backward Shift-Add algorithm for the k-LCF problem [Hirvola, 2016] (which searches the longest common substring with up to k errors under the Hamming distance).

This algorithm is modified:

  • to search only perfect matches (up to 0 errors),
  • to return as soon as possible if it finds a common substring and
  • to share the position array representation with edit_distance (the original algorithm used reverse “incidence matrix”).
Usage Constraints
  • This object must be valid on a given length len.
  • len (the length of this object) must not exceed 64.
  • All elements in other must be less than BlockHash::ALPHABET_SIZE.
source

pub unsafe fn edit_distance_unchecked(&self, len: u8, other: &[u8]) -> u32

Available on crate feature unsafe only.

Computes the edit distance between two given strings

Specifically, it computes the Longest Common Subsequence (LCS) distance, allowing character insertion and deletion as two primitive operations (in cost 1).

Algorithm Implemented

[Hyyrö et al., 2005] (doi:10.1007/11427186_33) presented a way to compute so called Indel-Distance using a bit-parallel approach and this method is based on it.

This algorithm is needed to be modified for our purpose because the purpose of the original algorithm is to find a “substring” similar to a pattern string.

Safety
  • This object must be valid on a given length len.
  • len (the length of this object) must not exceed 64.
  • All elements in other must be less than BlockHash::ALPHABET_SIZE.

If they are not satisfied, it will return a meaningless distance.

source

pub fn edit_distance(&self, len: u8, other: &[u8]) -> u32

Computes the edit distance between two given strings

Specifically, it computes the Longest Common Subsequence (LCS) distance, allowing character insertion and deletion as two primitive operations (in cost 1).

Algorithm Implemented

[Hyyrö et al., 2005] (doi:10.1007/11427186_33) presented a way to compute so called Indel-Distance using a bit-parallel approach and this method is based on it.

This algorithm is needed to be modified for our purpose because the purpose of the original algorithm is to find a “substring” similar to a pattern string.

Usage Constraints
  • This object must be valid on a given length len.
  • len (the length of this object) must not exceed 64.
  • All elements in other must be less than BlockHash::ALPHABET_SIZE.
source

pub unsafe fn score_strings_raw_unchecked(&self, len: u8, other: &[u8]) -> u32

Available on crate feature unsafe only.

Compare two block hashes and computes the similarity score without capping.

This method does not “cap” the score to prevent exaggregating the matches that are not meaningful enough, making this function block size independent.

Safety

If they are not satisfied, it will return a meaningless score.

source

pub fn score_strings_raw(&self, len: u8, other: &[u8]) -> u32

Compare two block hashes and computes the similarity score without capping.

This method does not “cap” the score to prevent exaggregating the matches that are not meaningful enough, making this function block size independent.

Usage Constraints
source

pub unsafe fn score_strings_unchecked( &self, len: u8, other: &[u8], log_block_size: u8 ) -> u32

Available on crate feature unsafe only.

Compare two block hashes and computes the similarity score.

This method “caps” the score to prevent exaggregating the matches that are not meaningful enough. This behavior depends on the block size (score cap gets higher when block size gets higher).

Safety

If they are not satisfied, it will return a meaningless score.

source

pub fn score_strings(&self, len: u8, other: &[u8], log_block_size: u8) -> u32

Compare two block hashes and computes the similarity score.

This method “caps” the score to prevent exaggregating the matches that are not meaningful enough. This behavior depends on the block size (score cap gets higher when block size gets higher).

Usage Constraints
source

pub const fn element_has_sequences(pa_elem: u64, len: u32) -> bool

Checks whether a given position array entry has a sequence of the given length (or longer).

Performance Analysis

This function expects many constant foldings assuming constant len. has_sequences_const forces to do that.

source

pub const fn element_has_sequences_const<const LEN: u32>(pa_elem: u64) -> bool

The generic variant of element_has_sequences.

It improves the performance by intensive constant foldings.

Trait Implementations§

source§

impl Clone for BlockHashPositionArray

source§

fn clone(&self) -> BlockHashPositionArray

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for BlockHashPositionArray

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Default for BlockHashPositionArray

source§

fn default() -> Self

Returns the “default value” for a type. Read more
source§

impl PartialEq<BlockHashPositionArray> for BlockHashPositionArray

source§

fn eq(&self, other: &BlockHashPositionArray) -> bool

This method tests for self and other values to be equal, and is used by ==.
1.0.0 · source§

fn ne(&self, other: &Rhs) -> bool

This method tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
source§

impl Copy for BlockHashPositionArray

source§

impl Eq for BlockHashPositionArray

source§

impl StructuralEq for BlockHashPositionArray

source§

impl StructuralPartialEq for BlockHashPositionArray

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for Twhere T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for Twhere T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for Twhere T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for Twhere U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for Twhere T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.