Crate orbok_extract

Expand description

§orbok-extract

Text extraction (RFC-005): pluggable extractors turn boundary- validated source files into normalized, line-located segments. Extraction output is derived data — cacheable, rebuildable, never authoritative.

Re-exports§

pub use chunker::chunk;
pub use plugin::PluginManifest;
pub use plugin::PluginRegistry;
pub use registry::ExtractorRegistry;
pub use types::DocumentExtractor;
pub use types::ExtractOutput;
pub use types::ExtractedSegment;
pub use types::LocationQuality;
pub use types::SegmentKind;

Modules§

chunker: Adaptive chunker (RFC-006 §7–§9).
docx: DOCX text extractor (Microsoft Word 2007+).
html: HTML text extractor.
normalize: Text normalization, version norm-v1 (RFC-005 §9).
pdf: PDF text extraction via lopdf (RFC-022 §6).
plugin: Plugin extractor interface (RFC-028 §7).
registry: Extractor registry (RFC-005 §6: selection by file type, typed unsupported results).
types: Extraction types (RFC-005 §6–§8).

Crate orbok_extract

Crate orbok_extract Copy item path

§orbok-extract

Re-exports§

Modules§

Crate orbok_extract