pub struct BlockHashPositionArray { /* private fields */ }Expand description
A position array-based block hash except its length.
Each element of the position array indicates which positions in the corresponding block hash have the given alphabet (note that array index is of the alphabet).
For instance, if representation()[5] == 0x81, it means the block hash
contains the alphabet index 5 in the positions 0 and 7
(block hash glob: E??????E*).
This is because bit 0 0x01 at index 5 means that the position 0 is the
alphabet index 5 (E). Likewise, bit 7 (0x80) at index 5
corresponds to the fact that position 7 is the alphabet index 5 (E).
This representation makes possible to make certain dynamic programming
algorithms bit-parallel. In other words, some table updates of
certain top-down dynamic programming algorithms can be
represented as logical expressions (with some arithmetic ones
to enable e.g. horizontal propagation). This is particularly
effective on ssdeep because each block hash has a maximum size of
BlockHash::FULL_SIZE (64; many 64-bit machines would handle that
efficiently and even 32-bit machines can benefit from).
This is so fast so that the bit-parallel approach is still faster even if we don’t use any batching.
For an example of such algorithms, see Bitap algorithm.
Important Note: Length not included
Note that this struct does not include its length. The length must be given from outside.
Implementations§
source§impl BlockHashPositionArray
impl BlockHashPositionArray
sourcepub fn new() -> Self
pub fn new() -> Self
Creates empty position array-based block hash object without length.
Because this object doesn’t have any characters, it is only valid on length zero.
sourcepub fn representation(&self) -> [u64; 64]
pub fn representation(&self) -> [u64; 64]
Returns raw representation of the block hash position array.
sourcepub fn init_from(&mut self, blockhash: &[u8])
pub fn init_from(&mut self, blockhash: &[u8])
Clear and initialize (encode) the object from given slice.
Usage Constraints
- The length of
blockhashmust not exceed 64. - All elements in
blockhashmust be less thanBlockHash::ALPHABET_SIZE.
sourcepub fn is_equiv(&self, len: u8, other: &[u8]) -> bool
pub fn is_equiv(&self, len: u8, other: &[u8]) -> bool
Compare whether two block hashes are equivalent.
Usage Constraints
- This object must be valid for given length
len. - The length of
othermust not exceed 64. - All elements in
othermust be less thanBlockHash::ALPHABET_SIZE.
sourcepub fn is_valid(&self, len: u8) -> bool
pub fn is_valid(&self, len: u8) -> bool
Performs full validity checking of a position array considering given length.
sourcepub fn has_common_substring(&self, len: u8, other: &[u8]) -> bool
pub fn has_common_substring(&self, len: u8, other: &[u8]) -> bool
Checks whether given two strings have common substrings with length
of MIN_LCS_FOR_BLOCKHASH.
Algorithm Implemented
This function implements Boyer–Moore-like bit-parallel algorithm to find fixed length common substring. The original algorithm is the Backward Shift-Add algorithm for k-LCF problem [Hirvola, 2016] (which searches longest common substring with up to k errors under the Hamming distance).
This algorithm is modified:
- to search only perfect matches (up to 0 errors),
- to return as soon as possible if it finds a common substring and
- to share position array representation with
edit_distance(the original algorithm used reverse “incidence matrix”).
Usage Constraints
- This object must be valid for given length
len. len(the length of this object) must be64or less.- All elements in
othermust be less thanBlockHash::ALPHABET_SIZE.
sourcepub fn edit_distance(&self, len: u8, other: &[u8]) -> u32
pub fn edit_distance(&self, len: u8, other: &[u8]) -> u32
Computes the edit distance between two given strings
In specific, it computes the Longest Common Subsequence (LCS) distance, allowing character insertion and deletion as two primitive operations (in cost 1).
While this method requires that this object is not empty,
its user, score_strings ensures that both this
object and other are at least
MIN_LCS_FOR_BLOCKHASH
in length.
Algorithm Implemented
[Hyyrö et al., 2005] (doi:10.1007/11427186_33) presented a way to compute so called Indel-Distance using a bit-parallel approach and this method is based on it.
This algorithm is needed to be modified for our purpose because the purpose of the original algorithm is to find a “substring” which is similar to a pattern string.
Usage Constraints
- This object must be valid for given length
len. len(the length of this object) must be64or less.- All elements in
othermust be less thanBlockHash::ALPHABET_SIZE.
sourcepub fn score_strings_raw(&self, len: u8, other: &[u8]) -> u32
pub fn score_strings_raw(&self, len: u8, other: &[u8]) -> u32
Compare two block hashes and computes the similarity score.
This method does not “cap” the score to prevent exaggregating the matches that are not meaningful enough, making this function block size independent.
Usage Constraints
- This object must be valid for given length
len. len(the length of this) must not exceedBlockHash::FULL_SIZE.- The length of
othermust not exceedBlockHash::FULL_SIZE. - All elements in
othermust be less thanBlockHash::ALPHABET_SIZE.
sourcepub fn score_strings(&self, len: u8, other: &[u8], log_block_size: u8) -> u32
pub fn score_strings(&self, len: u8, other: &[u8], log_block_size: u8) -> u32
Compare two block hashes and computes the similarity score.
This method “caps” the score to prevent exaggregating the matches that are not meaningful enough. This behavior depends on the block size (score cap gets higher when block size gets higher).
Usage Constraints
- This object must be valid for given length
len. len(the length of this) must not exceedBlockHash::FULL_SIZE.- The length of
othermust not exceedBlockHash::FULL_SIZE. - All elements in
othermust be less thanBlockHash::ALPHABET_SIZE. log_block_sizemust be valid.
sourcepub const fn element_has_sequences(pa_elem: u64, len: u32) -> bool
pub const fn element_has_sequences(pa_elem: u64, len: u32) -> bool
Checks whether given position array entry has a sequence of the given length (or longer).
Performance Analysis
This function expects many constant foldings assuming constant len.
has_sequences_const forces
to do that.
sourcepub const fn element_has_sequences_const<const LEN: u32>(pa_elem: u64) -> bool
pub const fn element_has_sequences_const<const LEN: u32>(pa_elem: u64) -> bool
The same function as element_has_sequences
but made to be a generic function.
It improves the performance by intensive constant foldings.
Trait Implementations§
source§impl Clone for BlockHashPositionArray
impl Clone for BlockHashPositionArray
source§fn clone(&self) -> BlockHashPositionArray
fn clone(&self) -> BlockHashPositionArray
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moresource§impl Debug for BlockHashPositionArray
impl Debug for BlockHashPositionArray
source§impl Default for BlockHashPositionArray
impl Default for BlockHashPositionArray
source§impl PartialEq<BlockHashPositionArray> for BlockHashPositionArray
impl PartialEq<BlockHashPositionArray> for BlockHashPositionArray
source§fn eq(&self, other: &BlockHashPositionArray) -> bool
fn eq(&self, other: &BlockHashPositionArray) -> bool
self and other values to be equal, and is used
by ==.