Ferret: Copy-Detection in Text and Code
Ferret is a copy-detection tool, locating duplicate text or code in multiple text documents or source files. Ferret is designed to detect copying ( collusion ) within a given set of files.
As a library, Ferret can be used to analyse program code or natural language texts into trigrams, and compare pairs of documents for similarity.
Features:
- compares text documents containing natural language or computer language
- computes a similarity measure based on the trigrams found within pairs of documents
- many major programming languages are recognised and tokenised appropriately
- outputs for analysis include:
- pairwise comparisons ordered by similarity, including trigram counts
- counts of unique trigrams within each file / group
- reverse index from trigrams to list of documents they are found in
- XML detailed comparison of a pair of documents
Command line use
$ ferret --help
Usage: ferret [-ghluvx] filename [filenames...]
-g, --group Use subdirectory names to group files
-h, --help Show help information
-l, --list-trigrams
Output list of trigrams found
-u, --unique-counts
Output counts of unique trigrams
-v, --version Version number
-x, --xml-report filename1 filename2 outfile : Create XML report
Library use
Take some files and find the two most similar:
use Documents;
Take a file, and read it trigram-by-trigram:
use TrigramReader;
use PathBuf;