Expand description
§orbok-extract
Text extraction (RFC-005): pluggable extractors turn boundary- validated source files into normalized, location-tagged segments. Extraction output is derived data — cacheable, rebuildable, never authoritative.
RFC-044 hardening adds: resource limits (ExtractLimits), structured
warnings (ExtractWarning), panic isolation (extract_safely),
explicit location semantics (LocationKind), and removal of the
orbok-db production dependency (chunker now produces
ExtractedChunk; the pipeline layer maps to ChunkSpec).
Re-exports§
pub use chunker::chunk;pub use plugin::PluginManifest;pub use plugin::PluginRegistry;pub use registry::ExtractorRegistry;pub use types::DocumentExtractor;pub use types::ExtractContext;pub use types::ExtractLimits;pub use types::ExtractOutput;pub use types::ExtractWarning;pub use types::ExtractedChunk;pub use types::ExtractedSegment;pub use types::LocationKind;pub use types::LocationQuality;pub use types::SegmentKind;
Modules§
- chunker
- Adaptive chunker (RFC-006 §7–§9; RFC-044 §14 boundary cleanup).
- docx
- DOCX text extractor (Microsoft Word 2007+; RFC-044 §16.4 resource limits).
- html
- HTML text extractor (RFC-005 §5; RFC-044 §16.3 resource limits).
- normalize
- Text normalization, version
norm-v1(RFC-005 §9). - PDF text extraction via lopdf (RFC-022 §6; RFC-044 §16.5 hardening).
- plugin
- Plugin extractor interface (RFC-028 §7).
- registry
- Extractor registry (RFC-005 §6; RFC-044 §11 panic isolation).
- types
- Extraction types (RFC-005 §6–§8; RFC-044 hardening).