Skip to main content

Module parse

Module parse 

Source
Expand description

Parse a document file by path: detect format, return normalized UTF-8 text.

Format detection is by file extension only (no magic-byte sniffing). The allow-list in [ALLOWED] is the source of truth for which file types Solo accepts via solo ingest; anything outside it errors with ParseError::UnsupportedExtension.

§Backends

  • Plaintext / markdown / source code → std::fs::read_to_string (must be valid UTF-8; latin-1 / shift-jis / etc. are rejected, matching the storage layer’s UTF-8-only invariant).
  • PDF → pdf_extract::extract_text (pure-Rust, no C deps; quality is acceptable for text-bearing PDFs but degrades on scanned / image-only PDFs — see ADR-0003 / risk #1 in 0083).
  • HTML → html2text::from_read with a deliberately huge wrap width (80 000 cols) so the chunker isn’t fed artificial line-breaks.

Structs§

ParsedDocument
What parse_file returns on success.

Enums§

ParseError
Errors surfaced from parse_file.

Functions§

parse_file
Parse a file at path. Returns the normalized text + mime_type + raw byte size of the source file (which is NOT the same as text.len() for PDF / HTML — those backends transform the input).