ferret 1.1.1

A trigram-based tool for detecting similarity in groups of text documents or program code.
Documentation
  • Coverage
  • 87.93%
    51 out of 58 items documented0 out of 33 items with examples
  • Size
  • Source code size: 85.56 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 4.12 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 11s Average build duration of successful builds.
  • all releases: 11s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Homepage
  • Repository
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • pcl-code

Ferret: Copy-Detection in Text and Code

Ferret is a copy-detection tool, locating duplicate text or code in multiple text documents or source files. Ferret is designed to detect copying ( collusion ) within a given set of files.

As a library, Ferret can be used to analyse program code or natural language texts into trigrams, and compare pairs of documents for similarity.

Features:

  • compares text documents containing natural language or computer language
  • computes a similarity measure based on the trigrams found within pairs of documents
  • many major programming languages are recognised and tokenised appropriately
  • outputs for analysis include:
    • pairwise comparisons ordered by similarity, including trigram counts
    • counts of unique trigrams within each file / group
    • reverse index from trigrams to list of documents they are found in
    • XML detailed comparison of a pair of documents

Command line use

$ ferret --help
Usage: ferret [-ghluvx] filename [filenames...]
 -g, --group       Use subdirectory names to group files
 -h, --help        Show help information
 -l, --list-trigrams
                   Output list of trigrams found
 -u, --unique-counts
                   Output counts of unique trigrams
 -v, --version     Version number
 -x, --xml-report  filename1 filename2 outfile : Create XML report

Library use

Take some files and find the two most similar:

use ferret::documents::Documents;

fn main() {
    let files = ["txt1.txt".to_string(), "txt2.txt".to_string(), "txt3.txt".to_string()];
    let docs = Documents::new(&files[..]);
    let results = docs.sorted_results(false);
    println!("Most similar pair: {}", results[0]);
}

Take a file, and read it trigram-by-trigram:

use ferret::trigram_reader::TrigramReader;
use std::path::PathBuf;

fn main() {
    let path = PathBuf::from(r"test.rb");
    let mut reader = TrigramReader::new(&path);

    while reader.read_trigram () {
        println!("Trigram {}", reader.last_trigram ());
    }
}