Crate neardup

Source
Expand description

§neardup

neardup is a library for finding near-duplicate spans in a document. It provides functions to compute n-grams of a text, calculate weighted jaccard similarity, and check whether the document contains spans whose similarity to the query is above a threshold.

§Method

§Fast Near-duplicate Matching

  • Input: Suffix $s$, document $d$, and $n$ of $n$-gram
  • Output: Whether $d$ has a span near-duplicate to $s$
§Pseudo-code in Python
def fast_near_duplicate_matching(s: list[int], d: list[int], n: int, threshold: float) -> bool:
    l_s = len(s)
    l_d = len(d)
    H = set(ngram(s, n))
    for i in range(max(l_d - l_s, 0)):
        if d[i:i+n] in H:
            for j in range(max(i - l_s + n, 0), i):
                t = d[j:j+l_s]
                if Jaccard_W(s, t) >= threshold:
                    return True
    return False

You can use fast hash functions like fxhash or rolling hash. When the size of $n$ of $n$-gram is small, fxhash is faster than rolling hash. However, when the size of $n$ is large, rolling hash is faster than fxhash because the rolling hash can calculate the hash value of the next $n$-gram in $O(1)$ time.

Structs§

RollingHash
A struct for rolling hash.

Functions§

has_doc_duplicate
Check whether the document contains spans whose similarity to the query is above a threshold using rabin-karp method with fxhash.
has_doc_duplicate_naive
Check whether the document contains spans whose similarity to the query is above a threshold using naive method.
has_doc_duplicate_rolling
Check whether the document contains spans whose similarity to the query is above a threshold using rabin-karp method with rolling hash.
ngram
Compute n-grams of a text using fxhash.
ngram_rolling
Compute n-grams of a text using rolling hash.
weighted_jaccard
Compute weighted jaccard similarity between two texts.