Expand description
UCFP Perceptual Fingerprinting
This crate handles the “what does it look like” part of fingerprinting. Given canonical tokens, it produces a compact signature that captures similarity - so near-duplicates will have similar fingerprints even if they’re not identical.
§What you need to know
- We only take canonical tokens. Don’t send us raw text or ingest metadata.
- Pure function: same input = same output. No I/O, no network, no randomness.
§The pipeline (three stages)
-
Shingling - Break tokens into overlapping windows of k tokens, hash each window to a 64-bit value. Captures local structure.
-
Winnowing - Pick the minimum hash from each sliding window. Reduces data size. This is just an optimization, not the actual LSH step.
-
MinHash - The real locality-sensitive hashing magic. Produces a fixed-size signature you can compare for Jaccard similarity.
§Quick example
use perceptual::{perceptualize_tokens, PerceptualConfig};
let tokens = vec!["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"];
let config = PerceptualConfig {
k: 3,
..Default::default()
};
let fingerprint = perceptualize_tokens(&tokens, &config).unwrap();
assert!(!fingerprint.minhash.is_empty());
assert_eq!(fingerprint.meta.k, 3);Re-exports§
pub use crate::config::PerceptualConfig;pub use crate::config::PerceptualError;pub use crate::fingerprint::PerceptualFingerprint;pub use crate::fingerprint::PerceptualMeta;pub use crate::fingerprint::WinnowedShingle;
Modules§
- config
- Configuration and error types for UCFP perceptual fingerprinting.
- fingerprint
- Fingerprint and metadata types for UCFP perceptual layer.
Constants§
- PERCEPTUAL_
ALGORITHM - Human‑readable algorithm identifier.
- PERCEPTUAL_
VERSION - Current perceptual algorithm version for this crate.
Functions§
- perceptualize_
tokens - Compute a perceptual fingerprint (shingles → winnow → MinHash).