UCFP Perceptual Fingerprinting
This crate handles the "what does it look like" part of fingerprinting. Given canonical tokens, it produces a compact signature that captures similarity - so near-duplicates will have similar fingerprints even if they're not identical.
What you need to know
- We only take canonical tokens. Don't send us raw text or ingest metadata.
- Pure function: same input = same output. No I/O, no network, no randomness.
The pipeline (three stages)
-
Shingling - Break tokens into overlapping windows of k tokens, hash each window to a 64-bit value. Captures local structure.
-
Winnowing - Pick the minimum hash from each sliding window. Reduces data size. This is just an optimization, not the actual LSH step.
-
MinHash - The real locality-sensitive hashing magic. Produces a fixed-size signature you can compare for Jaccard similarity.
Quick example
use ;
let tokens = vec!;
let config = PerceptualConfig ;
let fingerprint = perceptualize_tokens.unwrap;
assert!;
assert_eq!;