Crate orbok_extract

Expand description

§orbok-extract

Text extraction (RFC-005): pluggable extractors turn boundary- validated source files into normalized, location-tagged segments. Extraction output is derived data — cacheable, rebuildable, never authoritative.

RFC-044 hardening adds: resource limits (ExtractLimits), structured warnings (ExtractWarning), panic isolation (extract_safely), explicit location semantics (LocationKind), and removal of the orbok-db production dependency (chunker now produces ExtractedChunk; the pipeline layer maps to ChunkSpec).

Re-exports§

pub use chunker::chunk;
pub use plugin::PluginManifest;
pub use plugin::PluginRegistry;
pub use registry::ExtractorRegistry;
pub use types::DocumentExtractor;
pub use types::ExtractContext;
pub use types::ExtractLimits;
pub use types::ExtractOutput;
pub use types::ExtractWarning;
pub use types::ExtractedChunk;
pub use types::ExtractedSegment;
pub use types::LocationKind;
pub use types::LocationQuality;
pub use types::SegmentKind;

Modules§

chunker: Adaptive chunker (RFC-006 §7–§9; RFC-044 §14 boundary cleanup).
docx: DOCX text extractor (Microsoft Word 2007+; RFC-044 §16.4 resource limits).
html: HTML text extractor (RFC-005 §5; RFC-044 §16.3 resource limits).
normalize: Text normalization, version norm-v1 (RFC-005 §9).
pdf: PDF text extraction via lopdf (RFC-022 §6; RFC-044 §16.5 hardening).
plugin: Plugin extractor interface (RFC-028 §7).
registry: Extractor registry (RFC-005 §6; RFC-044 §11 panic isolation).
types: Extraction types (RFC-005 §6–§8; RFC-044 hardening).