UCFP Index
This crate provides a backend-agnostic index for storing and searching Universal Content Fingerprinting (UCFP) records. It is designed to handle canonical hashes, perceptual fingerprints, and semantic embeddings, offering a unified interface for persistence and retrieval.
Core Features
- Pluggable Backends: Supports multiple storage backends through a common
[
IndexBackend] trait. Out of the box, it provides:- Redb (default): Pure Rust ACID-compliant embedded database. No C++ dependencies.
- In-memory: HashMap-based backend for fast, ephemeral storage (ideal for testing).
- Flexible Configuration: All behaviors, including the choice of backend,
compression, and quantization strategies, are configured at runtime via the
[
IndexConfig] struct. - Efficient Storage:
- Quantization: Provides utilities to quantize
f32embeddings intoi8vectors to reduce storage space and improve query performance. Use thequantizeorquantize_with_strategymethods before creatingIndexRecordinstances. - Compression: Compresses serialized records (using Zstd by default) before writing to the backend.
- Quantization: Provides utilities to quantize
- Similarity Search: Provides search capabilities for both semantic and
perceptual fingerprints:
- Semantic Search: Computes cosine similarity on quantized embeddings.
- Perceptual Search: Computes Jaccard similarity on MinHash signatures.
Backend Selection Guide
| Backend | Use Case | Dependencies | Compile Time |
|---|---|---|---|
| Redb (default) | Production, single-node | None (pure Rust) | Fast |
| InMemory | Testing, development | None | Fastest |
Why Redb?
Redb is the default backend because:
- No C++ dependencies: Compiles with just Rust toolchain (no clang/LLVM required)
- ACID transactions: Crash-safe by default with MVCC
- Pure Rust: Better integration with Rust ecosystem, easier cross-compilation
- Fast compilation: No C++ compilation overhead
- Future-proof: Easy migration path to PostgreSQL when horizontal scaling is needed
Configuration Examples
use ;
// Redb (default, recommended)
let config = new
.with_backend;
// In-memory (testing)
let config = new
.with_backend;
Key Concepts
The central struct is [UfpIndex], which provides a high-level API for
interacting with the index. It handles the details of serialization,
compression, and quantization, allowing callers to work with the simple
[IndexRecord] struct.
The [IndexBackend] trait abstracts the underlying storage mechanism, making
it easy to swap out backends or implement custom ones.
Example Usage
use ;
use json;
// Configure with Redb (default, persistent storage)
let config = new.with_backend;
let index = new.unwrap;
// Create and insert a record
let record = IndexRecord ;
index.upsert.unwrap;
// Search for similar records
let query_record = IndexRecord ;
let results = index.search.unwrap;
// assert_eq!(results.len(), 1);
// assert_eq!(results[0].canonical_hash, "doc-1");