Skip to main content

Crate ingest

Crate ingest 

Source
Expand description

UCFP Ingest Layer - Content Ingestion and Validation

This crate provides the entry point to the Universal Content Fingerprinting (UCFP) pipeline, transforming raw content and metadata into clean, deterministic records suitable for downstream processing.

§Overview

The ingest crate is responsible for:

  • Validation: Enforcing metadata policies, size limits, and business rules
  • Normalization: Collapsing whitespace, stripping control characters, sanitizing inputs
  • ID Generation: Deriving stable document IDs using UUIDv5 when not explicitly provided
  • Multi-modal Support: Handling text, binary, and structured payloads uniformly
  • Observability: Structured logging via tracing for production debugging

§Pipeline Position

Raw Content ──▶ Ingest ──▶ Canonical ──▶ Perceptual/Semantic ──▶ Index ──▶ Match
                   ↑
                (this crate)

§Quick Start

use ingest::{
    ingest, IngestConfig, RawIngestRecord,
    IngestSource, IngestMetadata, IngestPayload
};
use chrono::Utc;

// Configure (use defaults for quick start)
let config = IngestConfig::default();

// Create a raw record
let record = RawIngestRecord {
    id: "doc-001".to_string(),
    source: IngestSource::RawText,
    metadata: IngestMetadata {
        tenant_id: Some("acme-corp".to_string()),
        doc_id: Some("report-q4-2024".to_string()),
        received_at: Some(Utc::now()),
        original_source: None,
        attributes: None,
    },
    payload: Some(IngestPayload::Text(
        "  Quarterly report: revenue up 15% YoY.   ".to_string()
    )),
};

// Ingest and get canonical record
let canonical = ingest(record, &config).unwrap();

assert_eq!(canonical.tenant_id, "acme-corp");
// Whitespace normalized: "Quarterly report: revenue up 15% YoY."

§Core Design Principles

  1. Fail Fast: Validation happens before any transformation
  2. Deterministic: Same input always produces same output (critical for fingerprinting)
  3. Observable: Every operation is logged with structured tracing
  4. Safe: Control characters stripped, sizes bounded, UTF-8 validated

§Architecture

The ingest pipeline follows a strict data flow:

  1. Payload Requirements Check: Verify source mandates are met
  2. Raw Size Validation: Enforce max_payload_bytes limit
  3. Metadata Normalization: Apply defaults, validate policies, sanitize
  4. Payload Normalization: Decode UTF-8, collapse whitespace, preserve binary
  5. Normalized Size Validation: Enforce max_normalized_bytes limit
  6. Canonical Record Construction: Build deterministic output

§Module Structure

  • config: Configuration types (IngestConfig, MetadataPolicy)
  • error: Error types (IngestError)
  • types: Data model (RawIngestRecord, CanonicalIngestRecord, etc.)
  • metadata: Metadata normalization and validation logic
  • payload: Payload validation and transformation utilities

§Error Handling

All errors are typed via IngestError for precise handling:

use ingest::{ingest, IngestError};

match ingest(record, &config) {
    Ok(canonical) => process(canonical),
    Err(IngestError::PayloadTooLarge(msg)) => {
        eprintln!("Content too large: {}", msg);
    }
    Err(IngestError::InvalidUtf8(msg)) => {
        eprintln!("Invalid encoding: {}", msg);
    }
    Err(e) => {
        eprintln!("Ingest failed: {}", e);
    }
}

§Configuration

For production use, configure size limits and policies:

use ingest::{IngestConfig, MetadataPolicy, RequiredField};
use uuid::Uuid;

let config = IngestConfig {
    version: 1,
    default_tenant_id: "default".to_string(),
    doc_id_namespace: Uuid::new_v5(&Uuid::NAMESPACE_DNS, b"myapp.example.com"),
    strip_control_chars: true,
    metadata_policy: MetadataPolicy {
        required_fields: vec![
            RequiredField::TenantId,
            RequiredField::DocId,
        ],
        max_attribute_bytes: Some(1024 * 1024), // 1 MB
        reject_future_timestamps: true,
    },
    max_payload_bytes: Some(50 * 1024 * 1024),      // 50 MB raw
    max_normalized_bytes: Some(10 * 1024 * 1024),   // 10 MB normalized
};

// Validate at startup
config.validate().expect("Invalid configuration");

§Performance

  • Base overhead: ~5-15μs for small payloads
  • Text normalization: O(n) where n = text length
  • Memory: Allocates new String during normalization
  • Thread safety: ingest() is pure and safe for parallel processing

§Examples

See the examples/ directory for complete working examples:

  • ingest_demo.rs: Basic text ingestion
  • batch_ingest.rs: Processing multiple records
  • size_limit_demo.rs: Size limit enforcement demonstration

§See Also

  • Crate documentation for comprehensive guides
  • config module for configuration details
  • types module for data structure definitions

Structs§

CanonicalIngestRecord
Normalized record produced by ingest.
IngestConfig
Runtime configuration for ingest behavior.
IngestMetadata
Metadata associated with an ingest request.
MetadataPolicy
Controls which metadata fields must be present and how optional blobs are constrained.
RawIngestRecord
The inbound record for ingest.

Enums§

CanonicalPayload
Normalized payload ready for downstream stages.
ConfigError
Errors that can occur when validating an IngestConfig.
IngestError
Errors that can occur during ingest normalization and validation.
IngestPayload
Raw payload content provided during ingest.
IngestSource
Source kinds accepted at ingest time.
RequiredField
Metadata identifiers that can be enforced via MetadataPolicy.

Functions§

ingest
Ingests a raw record and produces a canonical, normalized record.
normalize_payload
Normalizes text by collapsing repeated whitespace and trimming edges.
normalize_payload_option
Normalizes the payload based on its type.
payload_kind
Returns a string representation of the payload kind for logging.
payload_length
Returns the length of the payload for logging.
validate_payload_requirements
Checks if the source requires a payload.