content-ingest 0.1.0

Content ingestion, validation, and normalization pipeline for text and binary data
Documentation

UCFP Ingest Layer - Content Ingestion and Validation

This crate provides the entry point to the Universal Content Fingerprinting (UCFP) pipeline, transforming raw content and metadata into clean, deterministic records suitable for downstream processing.

Overview

The ingest crate is responsible for:

  • Validation: Enforcing metadata policies, size limits, and business rules
  • Normalization: Collapsing whitespace, stripping control characters, sanitizing inputs
  • ID Generation: Deriving stable document IDs using UUIDv5 when not explicitly provided
  • Multi-modal Support: Handling text, binary, and structured payloads uniformly
  • Observability: Structured logging via tracing for production debugging

Pipeline Position

Raw Content ──▶ Ingest ──▶ Canonical ──▶ Perceptual/Semantic ──▶ Index ──▶ Match
                   ↑
                (this crate)

Quick Start

use ingest::{
    ingest, IngestConfig, RawIngestRecord,
    IngestSource, IngestMetadata, IngestPayload
};
use chrono::Utc;

// Configure (use defaults for quick start)
let config = IngestConfig::default();

// Create a raw record
let record = RawIngestRecord {
    id: "doc-001".to_string(),
    source: IngestSource::RawText,
    metadata: IngestMetadata {
        tenant_id: Some("acme-corp".to_string()),
        doc_id: Some("report-q4-2024".to_string()),
        received_at: Some(Utc::now()),
        original_source: None,
        attributes: None,
    },
    payload: Some(IngestPayload::Text(
        "  Quarterly report: revenue up 15% YoY.   ".to_string()
    )),
};

// Ingest and get canonical record
let canonical = ingest(record, &config).unwrap();

assert_eq!(canonical.tenant_id, "acme-corp");
// Whitespace normalized: "Quarterly report: revenue up 15% YoY."

Core Design Principles

  1. Fail Fast: Validation happens before any transformation
  2. Deterministic: Same input always produces same output (critical for fingerprinting)
  3. Observable: Every operation is logged with structured tracing
  4. Safe: Control characters stripped, sizes bounded, UTF-8 validated

Architecture

The ingest pipeline follows a strict data flow:

  1. Payload Requirements Check: Verify source mandates are met
  2. Raw Size Validation: Enforce max_payload_bytes limit
  3. Metadata Normalization: Apply defaults, validate policies, sanitize
  4. Payload Normalization: Decode UTF-8, collapse whitespace, preserve binary
  5. Normalized Size Validation: Enforce max_normalized_bytes limit
  6. Canonical Record Construction: Build deterministic output

Module Structure

  • config: Configuration types (IngestConfig, MetadataPolicy)
  • error: Error types (IngestError)
  • types: Data model (RawIngestRecord, CanonicalIngestRecord, etc.)
  • metadata: Metadata normalization and validation logic
  • payload: Payload validation and transformation utilities

Error Handling

All errors are typed via [IngestError] for precise handling:

use ingest::{ingest, IngestError};

match ingest(record, &config) {
    Ok(canonical) => process(canonical),
    Err(IngestError::PayloadTooLarge(msg)) => {
        eprintln!("Content too large: {}", msg);
    }
    Err(IngestError::InvalidUtf8(msg)) => {
        eprintln!("Invalid encoding: {}", msg);
    }
    Err(e) => {
        eprintln!("Ingest failed: {}", e);
    }
}

Configuration

For production use, configure size limits and policies:

use ingest::{IngestConfig, MetadataPolicy, RequiredField};
use uuid::Uuid;

let config = IngestConfig {
    version: 1,
    default_tenant_id: "default".to_string(),
    doc_id_namespace: Uuid::new_v5(&Uuid::NAMESPACE_DNS, b"myapp.example.com"),
    strip_control_chars: true,
    metadata_policy: MetadataPolicy {
        required_fields: vec![
            RequiredField::TenantId,
            RequiredField::DocId,
        ],
        max_attribute_bytes: Some(1024 * 1024), // 1 MB
        reject_future_timestamps: true,
    },
    max_payload_bytes: Some(50 * 1024 * 1024),      // 50 MB raw
    max_normalized_bytes: Some(10 * 1024 * 1024),   // 10 MB normalized
};

// Validate at startup
config.validate().expect("Invalid configuration");

Performance

  • Base overhead: ~5-15μs for small payloads
  • Text normalization: O(n) where n = text length
  • Memory: Allocates new String during normalization
  • Thread safety: ingest() is pure and safe for parallel processing

Examples

See the examples/ directory for complete working examples:

  • ingest_demo.rs: Basic text ingestion
  • batch_ingest.rs: Processing multiple records
  • size_limit_demo.rs: Size limit enforcement demonstration

See Also

  • Crate documentation for comprehensive guides
  • config module for configuration details
  • types module for data structure definitions