Expand description
Parse a document file by path: detect format, return normalized UTF-8 text.
Format detection is by file extension only (no magic-byte sniffing). The
allow-list in [ALLOWED] is the source of truth for which file types
Solo accepts via solo ingest; anything outside it errors with
ParseError::UnsupportedExtension.
§Backends
- Plaintext / markdown / source code →
std::fs::read_to_string(must be valid UTF-8; latin-1 / shift-jis / etc. are rejected, matching the storage layer’s UTF-8-only invariant). - PDF →
pdf_extract::extract_text(pure-Rust, no C deps; quality is acceptable for text-bearing PDFs but degrades on scanned / image-only PDFs — see ADR-0003 / risk #1 in 0083). - HTML →
html2text::from_readwith a deliberately huge wrap width (80 000 cols) so the chunker isn’t fed artificial line-breaks.
Structs§
- Parsed
Document - What
parse_filereturns on success.
Enums§
- Parse
Error - Errors surfaced from
parse_file.
Functions§
- parse_
file - Parse a file at
path. Returns the normalized text + mime_type + raw byte size of the source file (which is NOT the same astext.len()for PDF / HTML — those backends transform the input).