Expand description
UCFP Ingest Layer - Content Ingestion and Validation
This crate provides the entry point to the Universal Content Fingerprinting (UCFP) pipeline, transforming raw content and metadata into clean, deterministic records suitable for downstream processing.
§Overview
The ingest crate is responsible for:
- Validation: Enforcing metadata policies, size limits, and business rules
- Normalization: Collapsing whitespace, stripping control characters, sanitizing inputs
- ID Generation: Deriving stable document IDs using UUIDv5 when not explicitly provided
- Multi-modal Support: Handling text, binary, and structured payloads uniformly
- Observability: Structured logging via
tracingfor production debugging
§Pipeline Position
Raw Content ──▶ Ingest ──▶ Canonical ──▶ Perceptual/Semantic ──▶ Index ──▶ Match
↑
(this crate)§Quick Start
use ingest::{
ingest, IngestConfig, RawIngestRecord,
IngestSource, IngestMetadata, IngestPayload
};
use chrono::Utc;
// Configure (use defaults for quick start)
let config = IngestConfig::default();
// Create a raw record
let record = RawIngestRecord {
id: "doc-001".to_string(),
source: IngestSource::RawText,
metadata: IngestMetadata {
tenant_id: Some("acme-corp".to_string()),
doc_id: Some("report-q4-2024".to_string()),
received_at: Some(Utc::now()),
original_source: None,
attributes: None,
},
payload: Some(IngestPayload::Text(
" Quarterly report: revenue up 15% YoY. ".to_string()
)),
};
// Ingest and get canonical record
let canonical = ingest(record, &config).unwrap();
assert_eq!(canonical.tenant_id, "acme-corp");
// Whitespace normalized: "Quarterly report: revenue up 15% YoY."§Core Design Principles
- Fail Fast: Validation happens before any transformation
- Deterministic: Same input always produces same output (critical for fingerprinting)
- Observable: Every operation is logged with structured tracing
- Safe: Control characters stripped, sizes bounded, UTF-8 validated
§Architecture
The ingest pipeline follows a strict data flow:
- Payload Requirements Check: Verify source mandates are met
- Raw Size Validation: Enforce
max_payload_byteslimit - Metadata Normalization: Apply defaults, validate policies, sanitize
- Payload Normalization: Decode UTF-8, collapse whitespace, preserve binary
- Normalized Size Validation: Enforce
max_normalized_byteslimit - Canonical Record Construction: Build deterministic output
§Module Structure
config: Configuration types (IngestConfig,MetadataPolicy)error: Error types (IngestError)types: Data model (RawIngestRecord,CanonicalIngestRecord, etc.)metadata: Metadata normalization and validation logicpayload: Payload validation and transformation utilities
§Error Handling
All errors are typed via IngestError for precise handling:
use ingest::{ingest, IngestError};
match ingest(record, &config) {
Ok(canonical) => process(canonical),
Err(IngestError::PayloadTooLarge(msg)) => {
eprintln!("Content too large: {}", msg);
}
Err(IngestError::InvalidUtf8(msg)) => {
eprintln!("Invalid encoding: {}", msg);
}
Err(e) => {
eprintln!("Ingest failed: {}", e);
}
}§Configuration
For production use, configure size limits and policies:
use ingest::{IngestConfig, MetadataPolicy, RequiredField};
use uuid::Uuid;
let config = IngestConfig {
version: 1,
default_tenant_id: "default".to_string(),
doc_id_namespace: Uuid::new_v5(&Uuid::NAMESPACE_DNS, b"myapp.example.com"),
strip_control_chars: true,
metadata_policy: MetadataPolicy {
required_fields: vec![
RequiredField::TenantId,
RequiredField::DocId,
],
max_attribute_bytes: Some(1024 * 1024), // 1 MB
reject_future_timestamps: true,
},
max_payload_bytes: Some(50 * 1024 * 1024), // 50 MB raw
max_normalized_bytes: Some(10 * 1024 * 1024), // 10 MB normalized
};
// Validate at startup
config.validate().expect("Invalid configuration");§Performance
- Base overhead: ~5-15μs for small payloads
- Text normalization: O(n) where n = text length
- Memory: Allocates new String during normalization
- Thread safety:
ingest()is pure and safe for parallel processing
§Examples
See the examples/ directory for complete working examples:
ingest_demo.rs: Basic text ingestionbatch_ingest.rs: Processing multiple recordssize_limit_demo.rs: Size limit enforcement demonstration
§See Also
- Crate documentation for comprehensive guides
configmodule for configuration detailstypesmodule for data structure definitions
Structs§
- Canonical
Ingest Record - Normalized record produced by ingest.
- Ingest
Config - Runtime configuration for ingest behavior.
- Ingest
Metadata - Metadata associated with an ingest request.
- Metadata
Policy - Controls which metadata fields must be present and how optional blobs are constrained.
- RawIngest
Record - The inbound record for ingest.
Enums§
- Canonical
Payload - Normalized payload ready for downstream stages.
- Config
Error - Errors that can occur when validating an
IngestConfig. - Ingest
Error - Errors that can occur during ingest normalization and validation.
- Ingest
Payload - Raw payload content provided during ingest.
- Ingest
Source - Source kinds accepted at ingest time.
- Required
Field - Metadata identifiers that can be enforced via
MetadataPolicy.
Functions§
- ingest
- Ingests a raw record and produces a canonical, normalized record.
- normalize_
payload - Normalizes text by collapsing repeated whitespace and trimming edges.
- normalize_
payload_ option - Normalizes the payload based on its type.
- payload_
kind - Returns a string representation of the payload kind for logging.
- payload_
length - Returns the length of the payload for logging.
- validate_
payload_ requirements - Checks if the source requires a payload.