Expand description
§neardup
neardup
is a library for finding near-duplicate spans in a document.
It provides functions to compute n-grams of a text, calculate weighted jaccard similarity, and check whether the document contains spans whose similarity to the query is above a threshold.
§Method
§Fast Near-duplicate Matching
- Input: Suffix $s$, document $d$, and $n$ of $n$-gram
- Output: Whether $d$ has a span near-duplicate to $s$
§Pseudo-code in Python
def fast_near_duplicate_matching(s: list[int], d: list[int], n: int, threshold: float) -> bool:
l_s = len(s)
l_d = len(d)
H = set(ngram(s, n))
for i in range(max(l_d - l_s, 0)):
if d[i:i+n] in H:
for j in range(max(i - l_s + n, 0), i):
t = d[j:j+l_s]
if Jaccard_W(s, t) >= threshold:
return True
return False
You can use fast hash functions like fxhash or rolling hash. When the size of $n$ of $n$-gram is small, fxhash is faster than rolling hash. However, when the size of $n$ is large, rolling hash is faster than fxhash because the rolling hash can calculate the hash value of the next $n$-gram in $O(1)$ time.
Structs§
- Rolling
Hash - A struct for rolling hash.
Functions§
- has_
doc_ duplicate - Check whether the document contains spans whose similarity to the query is above a threshold using rabin-karp method with fxhash.
- has_
doc_ duplicate_ naive - Check whether the document contains spans whose similarity to the query is above a threshold using naive method.
- has_
doc_ duplicate_ rolling - Check whether the document contains spans whose similarity to the query is above a threshold using rabin-karp method with rolling hash.
- ngram
- Compute n-grams of a text using fxhash.
- ngram_
rolling - Compute n-grams of a text using rolling hash.
- weighted_
jaccard - Compute weighted jaccard similarity between two texts.