Expand description
§UCFP Index
This crate provides a backend-agnostic index for storing and searching Universal Content Fingerprinting (UCFP) records. It is designed to handle canonical hashes, perceptual fingerprints, and semantic embeddings, offering a unified interface for persistence and retrieval.
§Core Features
- Pluggable Backends: Supports multiple storage backends through a common
IndexBackendtrait. Out of the box, it provides:- Redb (default): Pure Rust ACID-compliant embedded database. No C++ dependencies.
- In-memory: HashMap-based backend for fast, ephemeral storage (ideal for testing).
- Flexible Configuration: All behaviors, including the choice of backend,
compression, and quantization strategies, are configured at runtime via the
IndexConfigstruct. - Efficient Storage:
- Quantization: Provides utilities to quantize
f32embeddings intoi8vectors to reduce storage space and improve query performance. Use thequantizeorquantize_with_strategymethods before creatingIndexRecordinstances. - Compression: Compresses serialized records (using Zstd by default) before writing to the backend.
- Quantization: Provides utilities to quantize
- Similarity Search: Provides search capabilities for both semantic and
perceptual fingerprints:
- Semantic Search: Computes cosine similarity on quantized embeddings.
- Perceptual Search: Computes Jaccard similarity on MinHash signatures.
§Backend Selection Guide
| Backend | Use Case | Dependencies | Compile Time |
|---|---|---|---|
| Redb (default) | Production, single-node | None (pure Rust) | Fast |
| InMemory | Testing, development | None | Fastest |
§Why Redb?
Redb is the default backend because:
- No C++ dependencies: Compiles with just Rust toolchain (no clang/LLVM required)
- ACID transactions: Crash-safe by default with MVCC
- Pure Rust: Better integration with Rust ecosystem, easier cross-compilation
- Fast compilation: No C++ compilation overhead
- Future-proof: Easy migration path to PostgreSQL when horizontal scaling is needed
§Configuration Examples
use index::{UfpIndex, IndexConfig, BackendConfig};
// Redb (default, recommended)
let config = IndexConfig::new()
.with_backend(BackendConfig::redb("/data/ucfp.redb"));
// In-memory (testing)
let config = IndexConfig::new()
.with_backend(BackendConfig::in_memory());§Key Concepts
The central struct is UfpIndex, which provides a high-level API for
interacting with the index. It handles the details of serialization,
compression, and quantization, allowing callers to work with the simple
IndexRecord struct.
The IndexBackend trait abstracts the underlying storage mechanism, making
it easy to swap out backends or implement custom ones.
§Example Usage
use index::{UfpIndex, IndexConfig, BackendConfig, IndexRecord, QueryMode, INDEX_SCHEMA_VERSION};
use serde_json::json;
// Configure with Redb (default, persistent storage)
let config = IndexConfig::new().with_backend(BackendConfig::redb("/tmp/ucfp.redb"));
let index = UfpIndex::new(config).unwrap();
// Create and insert a record
let record = IndexRecord {
schema_version: INDEX_SCHEMA_VERSION,
canonical_hash: "doc-1".to_string(),
perceptual: Some(vec![1, 2, 3]),
embedding: Some(vec![10, 20, 30]),
metadata: json!({ "title": "My Document" }),
};
index.upsert(&record).unwrap();
// Search for similar records
let query_record = IndexRecord {
schema_version: INDEX_SCHEMA_VERSION,
canonical_hash: "query-1".to_string(),
perceptual: Some(vec![1, 2, 4]),
embedding: Some(vec![11, 22, 33]),
metadata: json!({}),
};
let results = index.search(&query_record, QueryMode::Perceptual, 10).unwrap();
// assert_eq!(results.len(), 1);
// assert_eq!(results[0].canonical_hash, "doc-1");Modules§
- ann
- Approximate Nearest Neighbor (ANN) search using HNSW algorithm.
Structs§
- Compression
Config - Compression behavior configuration.
- InMemory
Backend - An in-memory backend using a
RwLockaround aHashMap. - Index
Config - Config for initializing the index.
- Index
Record - Unified index record for any modality Unified index record for any modality.
- Query
Result - Result entry for a similarity query.
- Redb
Backend - Redb backend implementation for persistent key-value storage.
- UfpIndex
- Index structure with lock-free concurrent access via DashMap and ANN support
Enums§
- Backend
Config - Configuration for selecting and building a backend.
- Compression
Codec - Compression codec options for index storage.
- Index
Error - Custom error type
- Quantization
Config - Quantization strategies for embeddings.
- Query
Mode - Defines the search mode
Constants§
- INDEX_
SCHEMA_ VERSION - Bump this value whenever the on-disk
IndexRecordlayout changes.
Traits§
- Index
Backend - Trait for a key-value storage backend for the index. This allows for different storage implementations (e.g., in-memory, Redb).
Type Aliases§
- Quantized
Vec - Quantized embedding type (compact float representation)