Skip to main content

Crate content_index

Crate content_index 

Source
Expand description

§UCFP Index

This crate provides a backend-agnostic index for storing and searching Universal Content Fingerprinting (UCFP) records. It is designed to handle canonical hashes, perceptual fingerprints, and semantic embeddings, offering a unified interface for persistence and retrieval.

§Core Features

  • Pluggable Backends: Supports multiple storage backends through a common IndexBackend trait. Out of the box, it provides:
    • Redb (default): Pure Rust ACID-compliant embedded database. No C++ dependencies.
    • In-memory: HashMap-based backend for fast, ephemeral storage (ideal for testing).
  • Flexible Configuration: All behaviors, including the choice of backend, compression, and quantization strategies, are configured at runtime via the IndexConfig struct.
  • Efficient Storage:
    • Quantization: Provides utilities to quantize f32 embeddings into i8 vectors to reduce storage space and improve query performance. Use the quantize or quantize_with_strategy methods before creating IndexRecord instances.
    • Compression: Compresses serialized records (using Zstd by default) before writing to the backend.
  • Similarity Search: Provides search capabilities for both semantic and perceptual fingerprints:
    • Semantic Search: Computes cosine similarity on quantized embeddings.
    • Perceptual Search: Computes Jaccard similarity on MinHash signatures.

§Backend Selection Guide

BackendUse CaseDependenciesCompile Time
Redb (default)Production, single-nodeNone (pure Rust)Fast
InMemoryTesting, developmentNoneFastest

§Why Redb?

Redb is the default backend because:

  • No C++ dependencies: Compiles with just Rust toolchain (no clang/LLVM required)
  • ACID transactions: Crash-safe by default with MVCC
  • Pure Rust: Better integration with Rust ecosystem, easier cross-compilation
  • Fast compilation: No C++ compilation overhead
  • Future-proof: Easy migration path to PostgreSQL when horizontal scaling is needed

§Configuration Examples

use index::{UfpIndex, IndexConfig, BackendConfig};

// Redb (default, recommended)
let config = IndexConfig::new()
    .with_backend(BackendConfig::redb("/data/ucfp.redb"));

// In-memory (testing)
let config = IndexConfig::new()
    .with_backend(BackendConfig::in_memory());

§Key Concepts

The central struct is UfpIndex, which provides a high-level API for interacting with the index. It handles the details of serialization, compression, and quantization, allowing callers to work with the simple IndexRecord struct.

The IndexBackend trait abstracts the underlying storage mechanism, making it easy to swap out backends or implement custom ones.

§Example Usage

use index::{UfpIndex, IndexConfig, BackendConfig, IndexRecord, QueryMode, INDEX_SCHEMA_VERSION};
use serde_json::json;

// Configure with Redb (default, persistent storage)
let config = IndexConfig::new().with_backend(BackendConfig::redb("/tmp/ucfp.redb"));
let index = UfpIndex::new(config).unwrap();

// Create and insert a record
let record = IndexRecord {
    schema_version: INDEX_SCHEMA_VERSION,
    canonical_hash: "doc-1".to_string(),
    perceptual: Some(vec![1, 2, 3]),
    embedding: Some(vec![10, 20, 30]),
    metadata: json!({ "title": "My Document" }),
};
index.upsert(&record).unwrap();

// Search for similar records
let query_record = IndexRecord {
    schema_version: INDEX_SCHEMA_VERSION,
    canonical_hash: "query-1".to_string(),
    perceptual: Some(vec![1, 2, 4]),
    embedding: Some(vec![11, 22, 33]),
    metadata: json!({}),
};

let results = index.search(&query_record, QueryMode::Perceptual, 10).unwrap();
// assert_eq!(results.len(), 1);
// assert_eq!(results[0].canonical_hash, "doc-1");

Modules§

ann
Approximate Nearest Neighbor (ANN) search using HNSW algorithm.

Structs§

CompressionConfig
Compression behavior configuration.
InMemoryBackend
An in-memory backend using a RwLock around a HashMap.
IndexConfig
Config for initializing the index.
IndexRecord
Unified index record for any modality Unified index record for any modality.
QueryResult
Result entry for a similarity query.
RedbBackend
Redb backend implementation for persistent key-value storage.
UfpIndex
Index structure with lock-free concurrent access via DashMap and ANN support

Enums§

BackendConfig
Configuration for selecting and building a backend.
CompressionCodec
Compression codec options for index storage.
IndexError
Custom error type
QuantizationConfig
Quantization strategies for embeddings.
QueryMode
Defines the search mode

Constants§

INDEX_SCHEMA_VERSION
Bump this value whenever the on-disk IndexRecord layout changes.

Traits§

IndexBackend
Trait for a key-value storage backend for the index. This allows for different storage implementations (e.g., in-memory, Redb).

Type Aliases§

QuantizedVec
Quantized embedding type (compact float representation)