Skip to main content

DocumentParser

Trait DocumentParser 

Source
pub trait DocumentParser: Send + Sync {
    // Required methods
    fn name(&self) -> &str;
    fn supported_extensions(&self) -> &[&str];
    fn parse(&self, path: &Path) -> Result<String>;

    // Provided methods
    fn parse_document(&self, path: &Path) -> Result<ParsedDocument> { ... }
    fn can_parse(&self, path: &Path) -> bool { ... }
    fn max_file_size(&self) -> u64 { ... }
}
Expand description

Extension point for custom file format parsing.

Implement this trait to add support for formats that cannot be read as plain text (PDF, Excel, Word, images with OCR, etc.) without modifying any core tool logic.

Required Methods§

Source

fn name(&self) -> &str

Unique parser identifier (used for logging and debugging).

Source

fn supported_extensions(&self) -> &[&str]

File extensions this parser handles (case-insensitive, no leading dot).

Example: &["pdf", "PDF"]

Source

fn parse(&self, path: &Path) -> Result<String>

Extract plain-text content from path.

Return Err if the file cannot be read or parsed; the registry will log a warning and skip the file rather than propagating the error.

Provided Methods§

Source

fn parse_document(&self, path: &Path) -> Result<ParsedDocument>

Extract a structured document from path.

The default implementation wraps DocumentParser::parse into a single raw-text block so existing parsers remain source-compatible.

Source

fn can_parse(&self, path: &Path) -> bool

Override to control whether this parser will attempt a file before the extension lookup. The default checks extension against supported_extensions().

Source

fn max_file_size(&self) -> u64

Maximum file size (bytes) this parser accepts. Files larger than this limit are silently skipped. Default: 10 MiB.

Implementors§