Skip to main content

Crate orbok_extract

Crate orbok_extract 

Source
Expand description

§orbok-extract

Text extraction (RFC-005): pluggable extractors turn boundary- validated source files into normalized, location-tagged segments. Extraction output is derived data — cacheable, rebuildable, never authoritative.

RFC-044 hardening adds: resource limits (ExtractLimits), structured warnings (ExtractWarning), panic isolation (extract_safely), explicit location semantics (LocationKind), and removal of the orbok-db production dependency (chunker now produces ExtractedChunk; the pipeline layer maps to ChunkSpec).

Re-exports§

pub use chunker::chunk;
pub use plugin::PluginManifest;
pub use plugin::PluginRegistry;
pub use registry::ExtractorRegistry;
pub use types::DocumentExtractor;
pub use types::ExtractContext;
pub use types::ExtractLimits;
pub use types::ExtractOutput;
pub use types::ExtractWarning;
pub use types::ExtractedChunk;
pub use types::ExtractedSegment;
pub use types::LocationKind;
pub use types::LocationQuality;
pub use types::SegmentKind;

Modules§

chunker
Adaptive chunker (RFC-006 §7–§9; RFC-044 §14 boundary cleanup).
docx
DOCX text extractor (Microsoft Word 2007+; RFC-044 §16.4 resource limits).
html
HTML text extractor (RFC-005 §5; RFC-044 §16.3 resource limits).
normalize
Text normalization, version norm-v1 (RFC-005 §9).
pdf
PDF text extraction via lopdf (RFC-022 §6; RFC-044 §16.5 hardening).
plugin
Plugin extractor interface (RFC-028 §7).
registry
Extractor registry (RFC-005 §6; RFC-044 §11 panic isolation).
types
Extraction types (RFC-005 §6–§8; RFC-044 hardening).