Expand description
§orbok-extract
Text extraction (RFC-005): pluggable extractors turn boundary- validated source files into normalized, line-located segments. Extraction output is derived data — cacheable, rebuildable, never authoritative.
Re-exports§
pub use chunker::chunk;pub use plugin::PluginManifest;pub use plugin::PluginRegistry;pub use registry::ExtractorRegistry;pub use types::DocumentExtractor;pub use types::ExtractOutput;pub use types::ExtractedSegment;pub use types::LocationQuality;pub use types::SegmentKind;
Modules§
- chunker
- Adaptive chunker (RFC-006 §7–§9).
- docx
- DOCX text extractor (Microsoft Word 2007+).
- html
- HTML text extractor.
- normalize
- Text normalization, version
norm-v1(RFC-005 §9). - PDF text extraction via lopdf (RFC-022 §6).
- plugin
- Plugin extractor interface (RFC-028 §7).
- registry
- Extractor registry (RFC-005 §6: selection by file type, typed unsupported results).
- types
- Extraction types (RFC-005 §6–§8).