docs.rs failed to build ucfp-0.1.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

Universal Content Fingerprinting (UCFP)

Deterministic, reproducible content fingerprints for text, audio, image, video, and documents

UCFP is a Rust framework that unifies exact hashing, perceptual similarity, and semantic embeddings into a single pipeline.

Traditional hashes fail when content changes slightly. Semantic search requires understanding beyond byte matching. UCFP gives you both—exact matches and meaning-based similarity—in one deterministic pipeline.

Deduplication — Find exact and near-duplicate content
Plagiarism Detection — Identify paraphrased text
Content Provenance — Track content across systems
Similarity Search — Search by meaning, not just keywords

Quickstart

Prerequisites: Rust 1.76+ (rustup toolchain install stable)

# Build & test
cargo test --all

# Run examples
cargo run --example full_pipeline          # complete pipeline
cargo run --example pipeline_metrics       # with observability
cargo run --package perceptual --example fingerprint_demo

Usage

use ucfp::{
    CanonicalizeConfig, IngestConfig, IngestPayload, IngestSource,
    PerceptualConfig, RawIngestRecord, PipelineStageConfig, process_pipeline,
};

let record = RawIngestRecord {
    id: "demo".into(),
    source: IngestSource::RawText,
    payload: Some(IngestPayload::Text("Hello world".into())),
    ..Default::default()
};

let (doc, fingerprint, _) = process_pipeline(
    record,
    PipelineStageConfig::Perceptual,
    &IngestConfig::default(),
    &CanonicalizeConfig::default(),
    Some(&PerceptualConfig::default()),
    None,
)?;

println!("Canonical hash: {}", doc.canonical_hash);
println!("MinHash bands: {}", fingerprint.unwrap().minhash_bands.len());

See examples/ for full pipeline demonstrations.

Full Pipeline Example

Complete workflow from ingest to matching:

use ucfp::{
    CanonicalizeConfig, IngestConfig, IngestMetadata, IngestPayload, IngestSource,
    PerceptualConfig, RawIngestRecord, SemanticConfig, PipelineStageConfig,
    process_pipeline,
};
use ucfp_index::{BackendConfig, IndexConfig, IndexRecord, UfpIndex};
use ucfp_matcher::{Matcher, MatchConfig, MatchRequest};

// 1. Configure all stages
let ingest_cfg = IngestConfig::default();
let canonical_cfg = CanonicalizeConfig::default();
let perceptual_cfg = PerceptualConfig::default();
let semantic_cfg = SemanticConfig::default();

// 2. Create index
let index_cfg = IndexConfig::new().with_backend(BackendConfig::InMemory);
let index = UfpIndex::new(index_cfg).unwrap();

// 3. Ingest a document
let record = RawIngestRecord {
    id: "doc-001".into(),
    source: IngestSource::RawText,
    metadata: IngestMetadata {
        tenant_id: Some("tenant-a".to_string()),
        doc_id: Some("my-doc".to_string()),
        ..Default::default()
    },
    payload: Some(IngestPayload::Text("Rust memory safety features".into())),
};

// 4. Process through pipeline (ingest -> canonical -> perceptual -> semantic)
let (doc, fingerprint, embedding) = process_pipeline(
    record,
    PipelineStageConfig::Perceptual,
    &ingest_cfg,
    &canonical_cfg,
    Some(&perceptual_cfg),
    Some(&semantic_cfg),
)?;

// 6. Store in index
let record = IndexRecord {
    doc_id: doc.doc_id.clone(),
    tenant_id: "tenant-a".to_string(),
    canonical_hash: doc.canonical_hash.clone(),
    perceptual_fingerprint: Some(fingerprint),
    semantic_embedding: Some(embedding),
    ..Default::default()
};
index.upsert(record)?;

// 7. Search with matcher
let matcher = Matcher::new(
    index,
    ingest_cfg,
    canonical_cfg,
    perceptual_cfg,
    semantic_cfg,
);

let req = MatchRequest {
    tenant_id: "tenant-a".to_string(),
    query_text: "Rust safety".to_string(),
    config: MatchConfig::default(),
    ..Default::default()
};

let hits = matcher.match_document(&req)?;
println!("Found {} matches", hits.len());

Architecture

Stage	Responsibility	Key Types
ingest	Validation, metadata normalization	`RawIngestRecord`, `CanonicalIngestRecord`
canonical	Unicode NFKC normalization, SHA-256 hashing	`CanonicalizedDocument`
perceptual	Rolling-hash shingles, winnowing, MinHash LSH	`PerceptualFingerprint`
semantic	Dense embeddings via ONNX	`SemanticEmbedding`
index	Storage with HNSW ANN search	`UfpIndex`, `QueryResult`
match	Query-time matching	`Matcher`, `MatchResult`

UCFP Architecture Diagram

Configuration

version: "1.0"

ingest:
  default_tenant_id: "acme-corp"
  max_payload_bytes: 10485760

canonical:
  normalize_unicode: true
  lowercase: true

perceptual:
  k: 9              # shingle size
  w: 4              # winnow window
  minhash_bands: 16

semantic:
  tier: "balanced"
  enable_chunking: true  # For documents > 512 tokens

index:
  backend: "redb"
  ann:
    enabled: true
    min_vectors_for_ann: 1000

Load in code:

use ucfp::config::UcfpConfig;
let config = UcfpConfig::from_file("config.yaml")?;

Performance

Stage	Latency	Notes
`ingest`	~45 μs	Validation + metadata
`canonical`	~180 μs	Unicode NFKC + SHA-256
`perceptual`	~180 μs	Parallel MinHash LSH
`semantic`	~8.5 ms	ONNX embedding
`index`	~50 μs	Lock-free DashMap
`match`	~50-450 μs	ANN O(log n) at >1K vectors

Optimizations: Lock-free concurrency, parallel MinHash, HNSW ANN search, HTTP/2 connection pooling, SIMD vector operations.

Disable semantic stage for ~100 μs/doc when exact + perceptual matching is sufficient.

API

REST API server included. Quick example:

curl -X POST http://localhost:8080/api/v1/process \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-api-key" \
  -d '{
    "text": "Your document content...",
    "enable_semantic": true
  }'

See crates/server/API.md for full API reference.

Roadmap

Modality	Status	Canonicalizer	Fingerprint	Embedding
Text	Ready	NFKC + tokenization	MinHash	BGE / E5
Image	Planned	DCT normalization	pHash	CLIP / SigLIP
Audio	Planned	Mel-spectrogram	Winnowing	SpeechCLIP / Whisper
Video	Planned	Keyframes	Scene hashes	VideoCLIP / XCLIP
Document	Planned	OCR + layout	Layout graph	LayoutLMv3

Development

./run-ci-local.sh  # Format, lint, test, build

See CONTRIBUTING.md for guidelines.

License

Apache-2.0

ucfp 0.1.0