Skip to main content

Crate orbok_extract

Crate orbok_extract 

Source
Expand description

§orbok-extract

Text extraction (RFC-005): pluggable extractors turn boundary- validated source files into normalized, line-located segments. Extraction output is derived data — cacheable, rebuildable, never authoritative.

Re-exports§

pub use registry::ExtractorRegistry;
pub use chunker::chunk;
pub use plugin::PluginManifest;
pub use plugin::PluginRegistry;
pub use types::DocumentExtractor;
pub use types::ExtractOutput;
pub use types::ExtractedSegment;
pub use types::LocationQuality;
pub use types::SegmentKind;

Modules§

chunker
Adaptive chunker (RFC-006 §7–§9).
docx
DOCX text extractor (Microsoft Word 2007+).
html
HTML text extractor.
normalize
Text normalization, version norm-v1 (RFC-005 §9).
pdf
PDF text extraction via lopdf (RFC-022 §6).
plugin
Plugin extractor interface (RFC-028 §7).
registry
Extractor registry (RFC-005 §6: selection by file type, typed unsupported results).
types
Extraction types (RFC-005 §6–§8).