UCFP Ingest Layer - Content Ingestion and Validation
This crate provides the entry point to the Universal Content Fingerprinting (UCFP) pipeline, transforming raw content and metadata into clean, deterministic records suitable for downstream processing.
Overview
The ingest crate is responsible for:
- Validation: Enforcing metadata policies, size limits, and business rules
- Normalization: Collapsing whitespace, stripping control characters, sanitizing inputs
- ID Generation: Deriving stable document IDs using UUIDv5 when not explicitly provided
- Multi-modal Support: Handling text, binary, and structured payloads uniformly
- Observability: Structured logging via
tracingfor production debugging
Pipeline Position
Raw Content ──▶ Ingest ──▶ Canonical ──▶ Perceptual/Semantic ──▶ Index ──▶ Match
↑
(this crate)
Quick Start
use ;
use Utc;
// Configure (use defaults for quick start)
let config = default;
// Create a raw record
let record = RawIngestRecord ;
// Ingest and get canonical record
let canonical = ingest.unwrap;
assert_eq!;
// Whitespace normalized: "Quarterly report: revenue up 15% YoY."
Core Design Principles
- Fail Fast: Validation happens before any transformation
- Deterministic: Same input always produces same output (critical for fingerprinting)
- Observable: Every operation is logged with structured tracing
- Safe: Control characters stripped, sizes bounded, UTF-8 validated
Architecture
The ingest pipeline follows a strict data flow:
- Payload Requirements Check: Verify source mandates are met
- Raw Size Validation: Enforce
max_payload_byteslimit - Metadata Normalization: Apply defaults, validate policies, sanitize
- Payload Normalization: Decode UTF-8, collapse whitespace, preserve binary
- Normalized Size Validation: Enforce
max_normalized_byteslimit - Canonical Record Construction: Build deterministic output
Module Structure
config: Configuration types (IngestConfig,MetadataPolicy)error: Error types (IngestError)types: Data model (RawIngestRecord,CanonicalIngestRecord, etc.)metadata: Metadata normalization and validation logicpayload: Payload validation and transformation utilities
Error Handling
All errors are typed via [IngestError] for precise handling:
use ;
match ingest
Configuration
For production use, configure size limits and policies:
use ;
use Uuid;
let config = IngestConfig ;
// Validate at startup
config.validate.expect;
Performance
- Base overhead: ~5-15μs for small payloads
- Text normalization: O(n) where n = text length
- Memory: Allocates new String during normalization
- Thread safety:
ingest()is pure and safe for parallel processing
Examples
See the examples/ directory for complete working examples:
ingest_demo.rs: Basic text ingestionbatch_ingest.rs: Processing multiple recordssize_limit_demo.rs: Size limit enforcement demonstration
See Also
- Crate documentation for comprehensive guides
configmodule for configuration detailstypesmodule for data structure definitions