content-index 0.1.0

UCFP backend-agnostic vector and fingerprint index
Documentation

UCFP Index

This crate provides a backend-agnostic index for storing and searching Universal Content Fingerprinting (UCFP) records. It is designed to handle canonical hashes, perceptual fingerprints, and semantic embeddings, offering a unified interface for persistence and retrieval.

Core Features

  • Pluggable Backends: Supports multiple storage backends through a common [IndexBackend] trait. Out of the box, it provides:
    • Redb (default): Pure Rust ACID-compliant embedded database. No C++ dependencies.
    • In-memory: HashMap-based backend for fast, ephemeral storage (ideal for testing).
  • Flexible Configuration: All behaviors, including the choice of backend, compression, and quantization strategies, are configured at runtime via the [IndexConfig] struct.
  • Efficient Storage:
    • Quantization: Provides utilities to quantize f32 embeddings into i8 vectors to reduce storage space and improve query performance. Use the quantize or quantize_with_strategy methods before creating IndexRecord instances.
    • Compression: Compresses serialized records (using Zstd by default) before writing to the backend.
  • Similarity Search: Provides search capabilities for both semantic and perceptual fingerprints:
    • Semantic Search: Computes cosine similarity on quantized embeddings.
    • Perceptual Search: Computes Jaccard similarity on MinHash signatures.

Backend Selection Guide

Backend Use Case Dependencies Compile Time
Redb (default) Production, single-node None (pure Rust) Fast
InMemory Testing, development None Fastest

Why Redb?

Redb is the default backend because:

  • No C++ dependencies: Compiles with just Rust toolchain (no clang/LLVM required)
  • ACID transactions: Crash-safe by default with MVCC
  • Pure Rust: Better integration with Rust ecosystem, easier cross-compilation
  • Fast compilation: No C++ compilation overhead
  • Future-proof: Easy migration path to PostgreSQL when horizontal scaling is needed

Configuration Examples

use index::{UfpIndex, IndexConfig, BackendConfig};

// Redb (default, recommended)
let config = IndexConfig::new()
    .with_backend(BackendConfig::redb("/data/ucfp.redb"));

// In-memory (testing)
let config = IndexConfig::new()
    .with_backend(BackendConfig::in_memory());

Key Concepts

The central struct is [UfpIndex], which provides a high-level API for interacting with the index. It handles the details of serialization, compression, and quantization, allowing callers to work with the simple [IndexRecord] struct.

The [IndexBackend] trait abstracts the underlying storage mechanism, making it easy to swap out backends or implement custom ones.

Example Usage

use index::{UfpIndex, IndexConfig, BackendConfig, IndexRecord, QueryMode, INDEX_SCHEMA_VERSION};
use serde_json::json;

// Configure with Redb (default, persistent storage)
let config = IndexConfig::new().with_backend(BackendConfig::redb("/tmp/ucfp.redb"));
let index = UfpIndex::new(config).unwrap();

// Create and insert a record
let record = IndexRecord {
    schema_version: INDEX_SCHEMA_VERSION,
    canonical_hash: "doc-1".to_string(),
    perceptual: Some(vec![1, 2, 3]),
    embedding: Some(vec![10, 20, 30]),
    metadata: json!({ "title": "My Document" }),
};
index.upsert(&record).unwrap();

// Search for similar records
let query_record = IndexRecord {
    schema_version: INDEX_SCHEMA_VERSION,
    canonical_hash: "query-1".to_string(),
    perceptual: Some(vec![1, 2, 4]),
    embedding: Some(vec![11, 22, 33]),
    metadata: json!({}),
};

let results = index.search(&query_record, QueryMode::Perceptual, 10).unwrap();
// assert_eq!(results.len(), 1);
// assert_eq!(results[0].canonical_hash, "doc-1");