Struct JaccardSearcher

Source
pub struct JaccardSearcher { /* private fields */ }
Expand description

Searcher for all pairs of similar documents in the Jaccard space.

§Approach

The search steps consist of

  1. Extracts features from documents, where a feature is a set representation of character or word ngrams.
  2. Convert the features into binary sketches through the 1-bit minwise hashing.
  3. Search for similar sketches in the Hamming space using ChunkedJoiner.

§Examples

use find_simdoc::JaccardSearcher;

let documents = vec![
    "Welcome to Jimbocho, the town of books and curry!",
    "Welcome to Jimbocho, the city of books and curry!",
    "We welcome you to Jimbocho, the town of books and curry.",
    "Welcome to the town of books and curry, Jimbocho!",
];

// Creates a searcher for character trigrams (with random seed value 42).
let searcher = JaccardSearcher::new(3, None, Some(42))
    .unwrap()
    // Builds the database of binary sketches converted from input documents,
    // where binary sketches are in the Hamming space of 20*64 dimensions.
    .build_sketches_in_parallel(documents.iter(), 20)
    .unwrap();

// Searches all similar pairs within radius 0.25.
let results = searcher.search_similar_pairs(0.25);
assert_eq!(results, vec![(0, 1, 0.19375), (0, 2, 0.2125), (0, 3, 0.2328125)]);

Implementations§

Source§

impl JaccardSearcher

Source

pub fn new( window_size: usize, delimiter: Option<char>, seed: Option<u64>, ) -> Result<Self>

Creates an instance.

§Arguments
  • window_size - Window size for w-shingling in feature extraction (must be more than 0).
  • delimiter - Delimiter for recognizing words as tokens in feature extraction. If None, characters are used for tokens.
  • seed - Seed value for random values.
Examples found in repository?
examples/find_jaccard.rs (line 12)
3fn main() {
4    let documents = vec![
5        "Welcome to Jimbocho, the town of books and curry!",
6        "Welcome to Jimbocho, the city of books and curry!",
7        "We welcome you to Jimbocho, the town of books and curry.",
8        "Welcome to the town of books and curry, Jimbocho!",
9    ];
10
11    // Creates a searcher for character trigrams (with random seed value 42).
12    let searcher = JaccardSearcher::new(3, None, Some(42))
13        .unwrap()
14        // Builds the database of binary sketches converted from input documents,
15        // where binary sketches are in the Hamming space of 20*64 dimensions.
16        .build_sketches_in_parallel(documents.iter(), 20)
17        .unwrap();
18
19    // Searches all similar pairs within radius 0.25.
20    let results = searcher.search_similar_pairs(0.25);
21    assert_eq!(results, vec![(0, 1, 0.1875), (0, 3, 0.2296875)]);
22}
Source

pub const fn shows_progress(self, yes: bool) -> Self

Shows the progress via the standard error output?

Source

pub fn build_sketches<I, D>( self, documents: I, num_chunks: usize, ) -> Result<Self>
where I: IntoIterator<Item = D>, D: AsRef<str>,

Builds the database of sketches from input documents.

§Arguments
  • documents - List of documents (must not include an empty string).
  • num_chunks - Number of chunks of sketches, indicating that the number of dimensions in the Hamming space is num_chunks*64.
Source

pub fn build_sketches_in_parallel<I, D>( self, documents: I, num_chunks: usize, ) -> Result<Self>
where I: Iterator<Item = D> + Send, D: AsRef<str> + Send,

Builds the database of sketches from input documents in parallel.

§Arguments
  • documents - List of documents (must not include an empty string).
  • num_chunks - Number of chunks of sketches, indicating that the number of dimensions in the Hamming space is num_chunks*64.
§Notes

The progress is not printed even if shows_progress = true.

Examples found in repository?
examples/find_jaccard.rs (line 16)
3fn main() {
4    let documents = vec![
5        "Welcome to Jimbocho, the town of books and curry!",
6        "Welcome to Jimbocho, the city of books and curry!",
7        "We welcome you to Jimbocho, the town of books and curry.",
8        "Welcome to the town of books and curry, Jimbocho!",
9    ];
10
11    // Creates a searcher for character trigrams (with random seed value 42).
12    let searcher = JaccardSearcher::new(3, None, Some(42))
13        .unwrap()
14        // Builds the database of binary sketches converted from input documents,
15        // where binary sketches are in the Hamming space of 20*64 dimensions.
16        .build_sketches_in_parallel(documents.iter(), 20)
17        .unwrap();
18
19    // Searches all similar pairs within radius 0.25.
20    let results = searcher.search_similar_pairs(0.25);
21    assert_eq!(results, vec![(0, 1, 0.1875), (0, 3, 0.2296875)]);
22}
Source

pub fn search_similar_pairs(&self, radius: f64) -> Vec<(usize, usize, f64)>

Searches for all pairs of similar documents within an input radius, returning triplets of the left-side id, the right-side id, and their distance.

Examples found in repository?
examples/find_jaccard.rs (line 20)
3fn main() {
4    let documents = vec![
5        "Welcome to Jimbocho, the town of books and curry!",
6        "Welcome to Jimbocho, the city of books and curry!",
7        "We welcome you to Jimbocho, the town of books and curry.",
8        "Welcome to the town of books and curry, Jimbocho!",
9    ];
10
11    // Creates a searcher for character trigrams (with random seed value 42).
12    let searcher = JaccardSearcher::new(3, None, Some(42))
13        .unwrap()
14        // Builds the database of binary sketches converted from input documents,
15        // where binary sketches are in the Hamming space of 20*64 dimensions.
16        .build_sketches_in_parallel(documents.iter(), 20)
17        .unwrap();
18
19    // Searches all similar pairs within radius 0.25.
20    let results = searcher.search_similar_pairs(0.25);
21    assert_eq!(results, vec![(0, 1, 0.1875), (0, 3, 0.2296875)]);
22}
Source

pub fn len(&self) -> usize

Gets the number of input documents.

Source

pub fn is_empty(&self) -> bool

Checks if the database is empty.

Source

pub fn memory_in_bytes(&self) -> usize

Gets the memory usage in bytes.

Source

pub const fn config(&self) -> &FeatureConfig

Gets the configure of feature extraction.

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V