[−][src]Module tique::topterms
Extract keywords and search for similar documents based on the contents of your index.
This module implements the same idea as Lucene's MoreLikeThis. You can read more about the idea in the original's documentation, but here's a gist of how it works:
-
Counts the words (Terms) from an arbitrary input: may be a string or the address of a document you already indexed; Then
-
Ranks each word using the frequencies from
1
and information from the index (how often it appears in the corpus, how many documents have it)
The result is a set of terms that are most relevant to represent your input in relation to your current index. I.e.: it finds words that are important and unique enough to describe your input.
Examples
Finding Similar Documents
let topterms = TopTerms::new(&index, vec![body, title])?; let keywords = topterms.extract_from_doc(10, doc_address); let nearest_neighbors = searcher.search(&keywords.into_query(), &TopDocs::with_limit(10))?;
Tuning the Keywords Extration
Depending on how your fields are indexed you might find that the results
from the keyword extration are not very good. Maybe it includes words
that are too uncommon, too small or anything. You can modify how TopDocs
works via a custom KeywordAcceptor
that you can use via the
extract_filtered
and extract_filtered_from_doc
methods:
let topterms = TopTerms::new(&index, vec![fulltext])?; let keywords = topterms.extract_filtered( 10, input, |term: &Term, term_freq, doc_freq, num_docs| { // Only words longer than 4 characters and that appear // in at least 10 documents term.text().chars().count() > 4 && doc_freq >= 10 } );
Structs
Keywords | Keywords is a collection of Term objects found via TopTerms |
TopTerms | TopTerms extracts the most relevant Keywords from your index |
Traits
KeywordAcceptor | Allows tuning the algorithm to pick the top keywords |