Crate tfidf

Source
Expand description

Library to calculate TF-IDF (Term Frequency - Inverse Document Frequency) for generic documents. The library provides strategies to act on objects that implement certain document traits (NaiveDocument, ProcessedDocument, ExpandableDocument).

For more information on the strategies that were implemented, check out Wikipedia.

§Document Types

A document is defined as a collection of terms. The documents don’t make assumptions about the term types (the terms are not normalized in any way).

These document types are of my design. The terminology isn’t standard, but they are fairly straight forward to understand.

  • NaiveDocument - A document is ‘naive’ if it only knows if a term is contained within it or not, but does not know HOW MANY of the instances of the term it contains.

  • ProcessedDocument - A document is ‘processed’ if it knows how many instances of each term is contained within it.

  • ExpandableDocument - A document is ‘expandable’ if provides a way to access each term contained within it.

§Example

The most simple way to calculate the TfIdf of a document is with the default implementation. Note, the library provides implementation of ProcessedDocument, for a Vec<(T, usize)>.

use tfidf::{TfIdf, TfIdfDefault};

let mut docs = Vec::new();
let doc1 = vec![("a", 3), ("b", 2), ("c", 4)];
let doc2 = vec![("a", 2), ("d", 5)];

docs.push(doc1);
docs.push(doc2);

assert_eq!(0f64, TfIdfDefault::tfidf("a", &docs[0], docs.iter()));
assert!(TfIdfDefault::tfidf("c", &docs[0], docs.iter()) > 0.5);

You can also roll your own strategies to calculate tf-idf using some strategies included in the library.

use tfidf::{TfIdf, ProcessedDocument};
use tfidf::tf::{RawFrequencyTf};
use tfidf::idf::{InverseFrequencySmoothIdf};

#[derive(Copy, Clone)] struct MyTfIdfStrategy;

impl<T> TfIdf<T> for MyTfIdfStrategy where T : ProcessedDocument {
  type Tf = RawFrequencyTf;
  type Idf = InverseFrequencySmoothIdf;
}



assert!(MyTfIdfStrategy::tfidf("a", &docs[0], docs.iter()) > 0f64);
assert!(MyTfIdfStrategy::tfidf("c", &docs[0], docs.iter()) > 0f64);

Modules§

  • Implementations of different weighting schemes for inverse document frequency (IDF). For more information about which ones are implemented, check the Wiki link in the crate description.
  • Implementations of different weighting schemes for term frequency (tf). For more information about which ones are implemented, check the Wiki link in the crate description.

Structs§

Traits§

  • A body of terms.
  • A document that can be expanded to a collection of terms.
  • A strategy to calculate a weighted or unweighted inverse document frequency (idf) for a single term within a corpus of documents.
  • A naive document with a simple function stating whether or not a term exists in the document or not. The document is naive , which means the frequencies of each term has yet to be determined. This type of document is useful for only some TF weighting schemes.
  • A strategy that uses a normalization factor.
  • A document where the frequencies of each term is already calculated.
  • A strategy that uses a smoothing factor.
  • A strategy to calculate a weighted or unweighted term frequency (tf) score of a term from a document.
  • Trait to create a strategy to calculate a tf-idf.