Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Universal Content Fingerprinting (UCFP)
Deterministic, reproducible content fingerprints for text, audio, image, video, and documents
UCFP is a Rust framework that unifies exact hashing, perceptual similarity, and semantic embeddings into a single pipeline.
Traditional hashes fail when content changes slightly. Semantic search requires understanding beyond byte matching. UCFP gives you both—exact matches and meaning-based similarity—in one deterministic pipeline.
- Deduplication — Find exact and near-duplicate content
- Plagiarism Detection — Identify paraphrased text
- Content Provenance — Track content across systems
- Similarity Search — Search by meaning, not just keywords
Quickstart
Prerequisites: Rust 1.76+ (rustup toolchain install stable)
# Build & test
# Run examples
Usage
use ;
let record = RawIngestRecord ;
let = process_pipeline?;
println!;
println!;
See examples/ for full pipeline demonstrations.
Full Pipeline Example
Complete workflow from ingest to matching:
use ;
use ;
use ;
// 1. Configure all stages
let ingest_cfg = default;
let canonical_cfg = default;
let perceptual_cfg = default;
let semantic_cfg = default;
// 2. Create index
let index_cfg = new.with_backend;
let index = new.unwrap;
// 3. Ingest a document
let record = RawIngestRecord ;
// 4. Process through pipeline (ingest -> canonical -> perceptual -> semantic)
let = process_pipeline?;
// 6. Store in index
let record = IndexRecord ;
index.upsert?;
// 7. Search with matcher
let matcher = new;
let req = MatchRequest ;
let hits = matcher.match_document?;
println!;
Architecture
| Stage | Responsibility | Key Types |
|---|---|---|
| ingest | Validation, metadata normalization | RawIngestRecord, CanonicalIngestRecord |
| canonical | Unicode NFKC normalization, SHA-256 hashing | CanonicalizedDocument |
| perceptual | Rolling-hash shingles, winnowing, MinHash LSH | PerceptualFingerprint |
| semantic | Dense embeddings via ONNX | SemanticEmbedding |
| index | Storage with HNSW ANN search | UfpIndex, QueryResult |
| match | Query-time matching | Matcher, MatchResult |

Configuration
version: "1.0"
ingest:
default_tenant_id: "acme-corp"
max_payload_bytes: 10485760
canonical:
normalize_unicode: true
lowercase: true
perceptual:
k: 9 # shingle size
w: 4 # winnow window
minhash_bands: 16
semantic:
tier: "balanced"
enable_chunking: true # For documents > 512 tokens
index:
backend: "redb"
ann:
enabled: true
min_vectors_for_ann: 1000
Load in code:
use UcfpConfig;
let config = from_file?;
Performance
| Stage | Latency | Notes |
|---|---|---|
ingest |
~45 μs | Validation + metadata |
canonical |
~180 μs | Unicode NFKC + SHA-256 |
perceptual |
~180 μs | Parallel MinHash LSH |
semantic |
~8.5 ms | ONNX embedding |
index |
~50 μs | Lock-free DashMap |
match |
~50-450 μs | ANN O(log n) at >1K vectors |
Optimizations: Lock-free concurrency, parallel MinHash, HNSW ANN search, HTTP/2 connection pooling, SIMD vector operations.
Disable semantic stage for ~100 μs/doc when exact + perceptual matching is sufficient.
API
REST API server included. Quick example:
See crates/server/API.md for full API reference.
Roadmap
| Modality | Status | Canonicalizer | Fingerprint | Embedding |
|---|---|---|---|---|
| Text | Ready | NFKC + tokenization | MinHash | BGE / E5 |
| Image | Planned | DCT normalization | pHash | CLIP / SigLIP |
| Audio | Planned | Mel-spectrogram | Winnowing | SpeechCLIP / Whisper |
| Video | Planned | Keyframes | Scene hashes | VideoCLIP / XCLIP |
| Document | Planned | OCR + layout | Layout graph | LayoutLMv3 |
Development
See CONTRIBUTING.md for guidelines.
License
Apache-2.0