Crate find_simdoc
source · [−]Expand description
Time- and memory-efficient all pairs similarity searches in documents. A more detailed description can be found on the project page.
Problem definition
- Input
- List of documents
- Distance function
- Radius threshold
- Output
- All pairs of similar document ids
Features
Easy to use
This software supports all essential steps of document similarity search, from feature extraction to output of similar pairs. Therefore, you can immediately try the fast all pairs similarity search using your document files.
Flexible tokenization
You can specify any delimiter when splitting words in tokenization for feature extraction. This can be useful in languages where multiple definitions of words exist, such as Japanese or Chinese.
Time and memory efficiency
The time and memory complexities are linear over the numbers of input documents and output results on the basis of the ideas behind the locality sensitive hashing (LSH) and sketch sorting approach.
Tunable search performance
LSH allows tuning of performance in accuracy, time, and memory, through a manual parameter specifying search dimensions. You can flexibly perform searches depending on your dataset and machine environment.
- Specifying lower dimensions allows for faster and rougher searches with less memory usage.
- Specifying higher dimensions allows for more accurate searches with more memory usage.
Search steps
- Extract features from documents
- Set representation of character or word ngrams
- Tfidf-weighted vector representation of character or word ngrams
- Convert the features into binary sketches through locality sensitive hashing
- 1-bit minwise hashing for the Jaccard similarity
- Simplified simhash for the Cosine similarity
- Search for similar sketches in the Hamming space using a modified variant of the sketch sorting approach
Re-exports
pub use cosine::CosineSearcher;
pub use jaccard::JaccardSearcher;