<div align="center">
# Universal Content Fingerprinting (UCFP)
**Deterministic, reproducible content fingerprints for text, audio, image, video, and documents**
[](https://www.rust-lang.org/)
[](https://github.com/bravo1goingdark/ucfp/actions)
[](LICENSE)
[](https://github.com/bravo1goingdark/ucfp/stargazers)
</div>
UCFP is a Rust framework that unifies **exact hashing**, **perceptual similarity**, and **semantic embeddings** into a single pipeline.
Traditional hashes fail when content changes slightly. Semantic search requires understanding beyond byte matching. UCFP gives you both—exact matches and meaning-based similarity—in one deterministic pipeline.
- **Deduplication** — Find exact and near-duplicate content
- **Plagiarism Detection** — Identify paraphrased text
- **Content Provenance** — Track content across systems
- **Similarity Search** — Search by meaning, not just keywords
## Quickstart
**Prerequisites**: Rust 1.76+ (`rustup toolchain install stable`)
```bash
# Build & test
cargo test --all
# Run examples
cargo run --example full_pipeline # complete pipeline
cargo run --example pipeline_metrics # with observability
cargo run --package perceptual --example fingerprint_demo
```
## Usage
```rust
use ucfp::{
CanonicalizeConfig, IngestConfig, IngestPayload, IngestSource,
PerceptualConfig, RawIngestRecord, PipelineStageConfig, process_pipeline,
};
let record = RawIngestRecord {
id: "demo".into(),
source: IngestSource::RawText,
payload: Some(IngestPayload::Text("Hello world".into())),
..Default::default()
};
let (doc, fingerprint, _) = process_pipeline(
record,
PipelineStageConfig::Perceptual,
&IngestConfig::default(),
&CanonicalizeConfig::default(),
Some(&PerceptualConfig::default()),
None,
)?;
println!("Canonical hash: {}", doc.canonical_hash);
println!("MinHash bands: {}", fingerprint.unwrap().minhash_bands.len());
```
See [`examples/`](examples/) for full pipeline demonstrations.
## Full Pipeline Example
Complete workflow from ingest to matching:
```rust
use ucfp::{
CanonicalizeConfig, IngestConfig, IngestMetadata, IngestPayload, IngestSource,
PerceptualConfig, RawIngestRecord, SemanticConfig, PipelineStageConfig,
process_pipeline,
};
use ucfp_index::{BackendConfig, IndexConfig, IndexRecord, UfpIndex};
use ucfp_matcher::{Matcher, MatchConfig, MatchRequest};
// 1. Configure all stages
let ingest_cfg = IngestConfig::default();
let canonical_cfg = CanonicalizeConfig::default();
let perceptual_cfg = PerceptualConfig::default();
let semantic_cfg = SemanticConfig::default();
// 2. Create index
let index_cfg = IndexConfig::new().with_backend(BackendConfig::InMemory);
let index = UfpIndex::new(index_cfg).unwrap();
// 3. Ingest a document
let record = RawIngestRecord {
id: "doc-001".into(),
source: IngestSource::RawText,
metadata: IngestMetadata {
tenant_id: Some("tenant-a".to_string()),
doc_id: Some("my-doc".to_string()),
..Default::default()
},
payload: Some(IngestPayload::Text("Rust memory safety features".into())),
};
// 4. Process through pipeline (ingest -> canonical -> perceptual -> semantic)
let (doc, fingerprint, embedding) = process_pipeline(
record,
PipelineStageConfig::Perceptual,
&ingest_cfg,
&canonical_cfg,
Some(&perceptual_cfg),
Some(&semantic_cfg),
)?;
// 6. Store in index
let record = IndexRecord {
doc_id: doc.doc_id.clone(),
tenant_id: "tenant-a".to_string(),
canonical_hash: doc.canonical_hash.clone(),
perceptual_fingerprint: Some(fingerprint),
semantic_embedding: Some(embedding),
..Default::default()
};
index.upsert(record)?;
// 7. Search with matcher
let matcher = Matcher::new(
index,
ingest_cfg,
canonical_cfg,
perceptual_cfg,
semantic_cfg,
);
let req = MatchRequest {
tenant_id: "tenant-a".to_string(),
query_text: "Rust safety".to_string(),
config: MatchConfig::default(),
..Default::default()
};
let hits = matcher.match_document(&req)?;
println!("Found {} matches", hits.len());
```
## Architecture
| **ingest** | Validation, metadata normalization | `RawIngestRecord`, `CanonicalIngestRecord` |
| **canonical** | Unicode NFKC normalization, SHA-256 hashing | `CanonicalizedDocument` |
| **perceptual** | Rolling-hash shingles, winnowing, MinHash LSH | `PerceptualFingerprint` |
| **semantic** | Dense embeddings via ONNX | `SemanticEmbedding` |
| **index** | Storage with HNSW ANN search | `UfpIndex`, `QueryResult` |
| **match** | Query-time matching | `Matcher`, `MatchResult` |

## Configuration
```yaml
version: "1.0"
ingest:
default_tenant_id: "acme-corp"
max_payload_bytes: 10485760
canonical:
normalize_unicode: true
lowercase: true
perceptual:
k: 9 # shingle size
w: 4 # winnow window
minhash_bands: 16
semantic:
tier: "balanced"
enable_chunking: true # For documents > 512 tokens
index:
backend: "redb"
ann:
enabled: true
min_vectors_for_ann: 1000
```
Load in code:
```rust
use ucfp::config::UcfpConfig;
let config = UcfpConfig::from_file("config.yaml")?;
```
## Performance
| `ingest` | ~45 μs | Validation + metadata |
| `canonical` | ~180 μs | Unicode NFKC + SHA-256 |
| `perceptual` | ~180 μs | Parallel MinHash LSH |
| `semantic` | ~8.5 ms | ONNX embedding |
| `index` | ~50 μs | Lock-free DashMap |
| `match` | ~50-450 μs | ANN O(log n) at >1K vectors |
**Optimizations**: Lock-free concurrency, parallel MinHash, HNSW ANN search, HTTP/2 connection pooling, SIMD vector operations.
Disable semantic stage for ~100 μs/doc when exact + perceptual matching is sufficient.
## API
REST API server included. Quick example:
```bash
curl -X POST http://localhost:8080/api/v1/process \
-H "Content-Type: application/json" \
-H "X-API-Key: your-api-key" \
-d '{
"text": "Your document content...",
"enable_semantic": true
}'
```
See [`crates/server/API.md`](crates/server/API.md) for full API reference.
## Roadmap
| **Text** | Ready | NFKC + tokenization | MinHash | BGE / E5 |
| **Image** | Planned | DCT normalization | pHash | CLIP / SigLIP |
| **Audio** | Planned | Mel-spectrogram | Winnowing | SpeechCLIP / Whisper |
| **Video** | Planned | Keyframes | Scene hashes | VideoCLIP / XCLIP |
| **Document** | Planned | OCR + layout | Layout graph | LayoutLMv3 |
## Development
```bash
./run-ci-local.sh # Format, lint, test, build
```
See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
## License
Apache-2.0