[][src]Module tique::topterms

Extract keywords and search for similar documents based on the contents of your index.

This module implements the same idea as Lucene's MoreLikeThis. You can read more about the idea in the original's documentation, but here's a gist of how it works:

  1. Counts the words (Terms) from an arbitrary input: may be a string or the address of a document you already indexed; Then

  2. Ranks each word using the frequencies from 1 and information from the index (how often it appears in the corpus, how many documents have it)

The result is a set of terms that are most relevant to represent your input in relation to your current index. I.e.: it finds words that are important and unique enough to describe your input.

Examples

Finding Similar Documents

 let topterms = TopTerms::new(&index, vec![body, title])?;
 let keywords = topterms.extract_from_doc(10, doc_address);

 let nearest_neighbors =
      searcher.search(&keywords.into_query(), &TopDocs::with_limit(10))?;

Tuning the Keywords Extration

Depending on how your fields are indexed you might find that the results from the keyword extration are not very good. Maybe it includes words that are too uncommon, too small or anything. You can modify how TopDocs works via a custom KeywordAcceptor that you can use via the extract_filtered and extract_filtered_from_doc methods:

 let topterms = TopTerms::new(&index, vec![fulltext])?;

 let keywords = topterms.extract_filtered(
      10,
      input,
      |term: &Term, term_freq, doc_freq, num_docs| {
          // Only words longer than 4 characters and that appear
          // in at least 10 documents
          term.text().chars().count() > 4 && doc_freq >= 10
      }
 );

Structs

Keywords

Keywords is a collection of Term objects found via TopTerms

TopTerms

TopTerms extracts the most relevant Keywords from your index

Traits

KeywordAcceptor

Allows tuning the algorithm to pick the top keywords