Skip to main content

Crate perceptual

Crate perceptual 

Source
Expand description

UCFP Perceptual Fingerprinting

This crate handles the “what does it look like” part of fingerprinting. Given canonical tokens, it produces a compact signature that captures similarity - so near-duplicates will have similar fingerprints even if they’re not identical.

§What you need to know

  • We only take canonical tokens. Don’t send us raw text or ingest metadata.
  • Pure function: same input = same output. No I/O, no network, no randomness.

§The pipeline (three stages)

  1. Shingling - Break tokens into overlapping windows of k tokens, hash each window to a 64-bit value. Captures local structure.

  2. Winnowing - Pick the minimum hash from each sliding window. Reduces data size. This is just an optimization, not the actual LSH step.

  3. MinHash - The real locality-sensitive hashing magic. Produces a fixed-size signature you can compare for Jaccard similarity.

§Quick example

use perceptual::{perceptualize_tokens, PerceptualConfig};

let tokens = vec!["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"];
let config = PerceptualConfig {
    k: 3,
    ..Default::default()
};

let fingerprint = perceptualize_tokens(&tokens, &config).unwrap();

assert!(!fingerprint.minhash.is_empty());
assert_eq!(fingerprint.meta.k, 3);

Re-exports§

pub use crate::config::PerceptualConfig;
pub use crate::config::PerceptualError;
pub use crate::fingerprint::PerceptualFingerprint;
pub use crate::fingerprint::PerceptualMeta;
pub use crate::fingerprint::WinnowedShingle;

Modules§

config
Configuration and error types for UCFP perceptual fingerprinting.
fingerprint
Fingerprint and metadata types for UCFP perceptual layer.

Constants§

PERCEPTUAL_ALGORITHM
Human‑readable algorithm identifier.
PERCEPTUAL_VERSION
Current perceptual algorithm version for this crate.

Functions§

perceptualize_tokens
Compute a perceptual fingerprint (shingles → winnow → MinHash).