pub struct BlockHashPositionArray { /* private fields */ }Expand description
A position array-based block hash except its length.
Each element of the position array indicates which positions in the corresponding block hash has the given alphabet (note that the array index is of the alphabet).
For instance, if representation()[5] == 0x81, it means the block hash
contains the alphabet index 5 in the positions 0 and 7
(block hash glob: E??????E*).
This is because the bit 0 (0x01) at the index 5 means that position 0 has
the alphabet with index 5 (E). Likewise, the bit 7 (0x80) at the
index 5 corresponds to the fact that position 7 has the alphabet with
index 5 (E).
This representation makes it possible to make some dynamic programming
algorithms bit-parallel. In other words, some table updates of
certain top-down dynamic programming algorithms can be
represented as logical expressions (with some arithmetic ones
to enable, for instance, horizontal propagation). This is particularly
effective on ssdeep because each block hash has a maximum size of
BlockHash::FULL_SIZE (64; many 64-bit machines would handle that
efficiently and even 32-bit machines can benefit from).
This is so fast that the bit-parallel approach is still faster even if we don’t use any batching.
For an example of such algorithms, see Bitap algorithm.
Important Note: Length not included
Note that this struct does not contain its length inside. The length must be given from outside each time you call the methods.
Incompatibility Notice
In v0.2.0, this struct is completely incompatible from v0.1.x.
Implementations§
source§impl BlockHashPositionArray
impl BlockHashPositionArray
sourcepub fn new() -> Self
pub fn new() -> Self
Creates empty position array-based block hash object without length.
Because the resulting object doesn’t contain any characters, it is only valid on length zero.
sourcepub fn representation(&self) -> [u64; 64]
pub fn representation(&self) -> [u64; 64]
Returns the raw representation of the block hash position array.
sourcepub fn init_from(&mut self, blockhash: &[u8])
pub fn init_from(&mut self, blockhash: &[u8])
Clear and initialize (encode) the object from a given slice.
Usage Constraints
- The length of
blockhashmust not exceed 64. - All elements in
blockhashmust be less thanBlockHash::ALPHABET_SIZE.
sourcepub unsafe fn is_equiv_unchecked(&self, len: u8, other: &[u8]) -> bool
Available on crate feature unsafe only.
pub unsafe fn is_equiv_unchecked(&self, len: u8, other: &[u8]) -> bool
unsafe only.Compare whether two block hashes are equivalent.
Safety
- This object must be valid on a given length
len. - The length of
othermust not exceed 64. - All elements in
othermust be less thanBlockHash::ALPHABET_SIZE.
If they are not satisfied, it will return a meaningless value.
sourcepub fn is_equiv(&self, len: u8, other: &[u8]) -> bool
pub fn is_equiv(&self, len: u8, other: &[u8]) -> bool
Compare whether two block hashes are equivalent.
Usage Constraints
- This object must be valid on a given length
len. - The length of
othermust not exceed 64. - All elements in
othermust be less thanBlockHash::ALPHABET_SIZE.
sourcepub fn is_valid(&self, len: u8) -> bool
pub fn is_valid(&self, len: u8) -> bool
Performs full validity checking of a position array considering a given length.
sourcepub unsafe fn has_common_substring_unchecked(
&self,
len: u8,
other: &[u8]
) -> bool
Available on crate feature unsafe only.
pub unsafe fn has_common_substring_unchecked( &self, len: u8, other: &[u8] ) -> bool
unsafe only.Checks whether two given strings have common substrings with a length
of MIN_LCS_FOR_BLOCKHASH.
Algorithm Implemented
This function implements a Boyer–Moore-like bit-parallel algorithm to find a fixed-length common substring. The original algorithm is the Backward Shift-Add algorithm for the k-LCF problem [Hirvola, 2016] (which searches the longest common substring with up to k errors under the Hamming distance).
This algorithm is modified:
- to search only perfect matches (up to 0 errors),
- to return as soon as possible if it finds a common substring and
- to share the position array representation with
edit_distance(the original algorithm used reverse “incidence matrix”).
Safety
- This object must be valid on a given length
len. len(the length of this object) must not exceed 64.- All elements in
othermust be less thanBlockHash::ALPHABET_SIZE.
If they are not satisfied, it will return a meaningless value.
sourcepub fn has_common_substring(&self, len: u8, other: &[u8]) -> bool
pub fn has_common_substring(&self, len: u8, other: &[u8]) -> bool
Checks whether two given strings have common substrings with a length
of MIN_LCS_FOR_BLOCKHASH.
Algorithm Implemented
This function implements a Boyer–Moore-like bit-parallel algorithm to find a fixed-length common substring. The original algorithm is the Backward Shift-Add algorithm for the k-LCF problem [Hirvola, 2016] (which searches the longest common substring with up to k errors under the Hamming distance).
This algorithm is modified:
- to search only perfect matches (up to 0 errors),
- to return as soon as possible if it finds a common substring and
- to share the position array representation with
edit_distance(the original algorithm used reverse “incidence matrix”).
Usage Constraints
- This object must be valid on a given length
len. len(the length of this object) must not exceed 64.- All elements in
othermust be less thanBlockHash::ALPHABET_SIZE.
sourcepub unsafe fn edit_distance_unchecked(&self, len: u8, other: &[u8]) -> u32
Available on crate feature unsafe only.
pub unsafe fn edit_distance_unchecked(&self, len: u8, other: &[u8]) -> u32
unsafe only.Computes the edit distance between two given strings
Specifically, it computes the Longest Common Subsequence (LCS) distance, allowing character insertion and deletion as two primitive operations (in cost 1).
Algorithm Implemented
[Hyyrö et al., 2005] (doi:10.1007/11427186_33) presented a way to compute so called Indel-Distance using a bit-parallel approach and this method is based on it.
This algorithm is needed to be modified for our purpose because the purpose of the original algorithm is to find a “substring” similar to a pattern string.
Safety
- This object must be valid on a given length
len. len(the length of this object) must not exceed 64.- All elements in
othermust be less thanBlockHash::ALPHABET_SIZE.
If they are not satisfied, it will return a meaningless distance.
sourcepub fn edit_distance(&self, len: u8, other: &[u8]) -> u32
pub fn edit_distance(&self, len: u8, other: &[u8]) -> u32
Computes the edit distance between two given strings
Specifically, it computes the Longest Common Subsequence (LCS) distance, allowing character insertion and deletion as two primitive operations (in cost 1).
Algorithm Implemented
[Hyyrö et al., 2005] (doi:10.1007/11427186_33) presented a way to compute so called Indel-Distance using a bit-parallel approach and this method is based on it.
This algorithm is needed to be modified for our purpose because the purpose of the original algorithm is to find a “substring” similar to a pattern string.
Usage Constraints
- This object must be valid on a given length
len. len(the length of this object) must not exceed 64.- All elements in
othermust be less thanBlockHash::ALPHABET_SIZE.
sourcepub unsafe fn score_strings_raw_unchecked(&self, len: u8, other: &[u8]) -> u32
Available on crate feature unsafe only.
pub unsafe fn score_strings_raw_unchecked(&self, len: u8, other: &[u8]) -> u32
unsafe only.Compare two block hashes and computes the similarity score without capping.
This method does not “cap” the score to prevent exaggregating the matches that are not meaningful enough, making this function block size independent.
Safety
- This object must be valid on a given length
len. len(the length of this object) must not exceedBlockHash::FULL_SIZE.- The length of
othermust not exceedBlockHash::FULL_SIZE. - All elements in
othermust be less thanBlockHash::ALPHABET_SIZE.
If they are not satisfied, it will return a meaningless score.
sourcepub fn score_strings_raw(&self, len: u8, other: &[u8]) -> u32
pub fn score_strings_raw(&self, len: u8, other: &[u8]) -> u32
Compare two block hashes and computes the similarity score without capping.
This method does not “cap” the score to prevent exaggregating the matches that are not meaningful enough, making this function block size independent.
Usage Constraints
- This object must be valid on a given length
len. len(the length of this object) must not exceedBlockHash::FULL_SIZE.- The length of
othermust not exceedBlockHash::FULL_SIZE. - All elements in
othermust be less thanBlockHash::ALPHABET_SIZE.
sourcepub unsafe fn score_strings_unchecked(
&self,
len: u8,
other: &[u8],
log_block_size: u8
) -> u32
Available on crate feature unsafe only.
pub unsafe fn score_strings_unchecked( &self, len: u8, other: &[u8], log_block_size: u8 ) -> u32
unsafe only.Compare two block hashes and computes the similarity score.
This method “caps” the score to prevent exaggregating the matches that are not meaningful enough. This behavior depends on the block size (score cap gets higher when block size gets higher).
Safety
- This object must be valid on a given length
len. len(the length of this object) must not exceedBlockHash::FULL_SIZE.- The length of
othermust not exceedBlockHash::FULL_SIZE. - All elements in
othermust be less thanBlockHash::ALPHABET_SIZE. log_block_sizemust be valid.
If they are not satisfied, it will return a meaningless score.
sourcepub fn score_strings(&self, len: u8, other: &[u8], log_block_size: u8) -> u32
pub fn score_strings(&self, len: u8, other: &[u8], log_block_size: u8) -> u32
Compare two block hashes and computes the similarity score.
This method “caps” the score to prevent exaggregating the matches that are not meaningful enough. This behavior depends on the block size (score cap gets higher when block size gets higher).
Usage Constraints
- This object must be valid on a given length
len. len(the length of this object) must not exceedBlockHash::FULL_SIZE.- The length of
othermust not exceedBlockHash::FULL_SIZE. - All elements in
othermust be less thanBlockHash::ALPHABET_SIZE. log_block_sizemust be valid.
sourcepub const fn element_has_sequences(pa_elem: u64, len: u32) -> bool
pub const fn element_has_sequences(pa_elem: u64, len: u32) -> bool
Checks whether a given position array entry has a sequence of the given length (or longer).
Performance Analysis
This function expects many constant foldings assuming constant len.
has_sequences_const forces
to do that.
sourcepub const fn element_has_sequences_const<const LEN: u32>(pa_elem: u64) -> bool
pub const fn element_has_sequences_const<const LEN: u32>(pa_elem: u64) -> bool
The generic variant of element_has_sequences.
It improves the performance by intensive constant foldings.
Trait Implementations§
source§impl Clone for BlockHashPositionArray
impl Clone for BlockHashPositionArray
source§fn clone(&self) -> BlockHashPositionArray
fn clone(&self) -> BlockHashPositionArray
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moresource§impl Debug for BlockHashPositionArray
impl Debug for BlockHashPositionArray
source§impl Default for BlockHashPositionArray
impl Default for BlockHashPositionArray
source§impl PartialEq<BlockHashPositionArray> for BlockHashPositionArray
impl PartialEq<BlockHashPositionArray> for BlockHashPositionArray
source§fn eq(&self, other: &BlockHashPositionArray) -> bool
fn eq(&self, other: &BlockHashPositionArray) -> bool
self and other values to be equal, and is used
by ==.