perceptual 0.1.0

UCFP perceptual fingerprinting (text shingling, winnowing, MinHash) crate
Documentation

UCFP Perceptual Fingerprinting

This crate handles the "what does it look like" part of fingerprinting. Given canonical tokens, it produces a compact signature that captures similarity - so near-duplicates will have similar fingerprints even if they're not identical.

What you need to know

  • We only take canonical tokens. Don't send us raw text or ingest metadata.
  • Pure function: same input = same output. No I/O, no network, no randomness.

The pipeline (three stages)

  1. Shingling - Break tokens into overlapping windows of k tokens, hash each window to a 64-bit value. Captures local structure.

  2. Winnowing - Pick the minimum hash from each sliding window. Reduces data size. This is just an optimization, not the actual LSH step.

  3. MinHash - The real locality-sensitive hashing magic. Produces a fixed-size signature you can compare for Jaccard similarity.

Quick example

use perceptual::{perceptualize_tokens, PerceptualConfig};

let tokens = vec!["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"];
let config = PerceptualConfig {
    k: 3,
    ..Default::default()
};

let fingerprint = perceptualize_tokens(&tokens, &config).unwrap();

assert!(!fingerprint.minhash.is_empty());
assert_eq!(fingerprint.meta.k, 3);