find_simdoc/
lib.rs

1//! Time- and memory-efficient all pairs similarity searches in documents.
2//! A more detailed description can be found on the [project page](https://github.com/legalforce-research/find-simdoc).
3//!
4//! # Problem definition
5//!
6//! - Input
7//!   - List of documents
8//!   - Distance function
9//!   - Radius threshold
10//! - Output
11//!   - All pairs of similar document ids
12//!
13//! # Features
14//!
15//! ## Easy to use
16//!
17//! This software supports all essential steps of document similarity search,
18//! from feature extraction to output of similar pairs.
19//! Therefore, you can immediately try the fast all pairs similarity search using your document files.
20//!
21//! ## Flexible tokenization
22//!
23//! You can specify any delimiter when splitting words in tokenization for feature extraction.
24//! This can be useful in languages where multiple definitions of words exist, such as Japanese or Chinese.
25//!
26//! ## Time and memory efficiency
27//!
28//! The time and memory complexities are *linear* over the numbers of input documents and output results
29//! on the basis of the ideas behind the locality sensitive hashing (LSH) and [sketch sorting approach](https://proceedings.mlr.press/v13/tabei10a.html).
30//!
31//! ## Tunable search performance
32//!
33//! LSH allows tuning of performance in accuracy, time, and memory, through a manual parameter specifying search dimensions.
34//! You can flexibly perform searches depending on your dataset and machine environment.
35//!   - Specifying lower dimensions allows for faster and rougher searches with less memory usage.
36//!   - Specifying higher dimensions allows for more accurate searches with more memory usage.
37//!
38//! # Search steps
39//!
40//! 1. Extract features from documents
41//!    - Set representation of character or word ngrams
42//!    - Tfidf-weighted vector representation of character or word ngrams
43//! 2. Convert the features into binary sketches through locality sensitive hashing
44//!    - [1-bit minwise hashing](https://dl.acm.org/doi/abs/10.1145/1772690.1772759) for the Jaccard similarity
45//!    - [Simplified simhash](https://dl.acm.org/doi/10.1145/1242572.1242592) for the Cosine similarity
46//! 3. Search for similar sketches in the Hamming space using a modified variant of the [sketch sorting approach](https://proceedings.mlr.press/v13/tabei10a.html)
47#![deny(missing_docs)]
48
49pub mod cosine;
50pub mod errors;
51pub mod feature;
52pub mod jaccard;
53pub mod lsh;
54pub mod tfidf;
55
56mod shingling;
57
58pub use cosine::CosineSearcher;
59pub use jaccard::JaccardSearcher;