Expand description
§mailrs-attachment-extract
Extract text from email (or any) attachments — PDF via
pdf-extract (pure Rust embedded-text path) and images via
the tesseract CLI subprocess.
Designed for the case where you need plain-text from arbitrary
attachments — indexing, search, embedding generation, spam scoring,
LLM context. No linking against libtesseract C library; the
crate shells out to the tesseract binary so it works wherever
tesseract is installed and avoids the C-library bindings tax.
§Two-stage PDF fallback
PDFs come in two flavours:
- Real PDFs with embedded text — extract via
pdf-extractin ~1 ms, confidence1.0, exact. - Scanned PDFs with image pages only — embedded extraction
returns near-empty text; fall back to OCR on the raw PDF bytes
(tesseract can handle PDFs directly), confidence
~0.85.
extract_content does the dispatch automatically — if embedded
text is < 50 chars (heuristic), it tries OCR as fallback.
§Quick start
use mailrs_attachment_extract::{extract_content, extraction_method, ExtractionMethod};
let pdf_bytes: &[u8] = b"%PDF-1.0\n..."; // your PDF bytes
// Single auto-dispatch entrypoint.
let result = extract_content(pdf_bytes, "application/pdf", "eng").unwrap();
println!("text: {}", result.text);
println!("confidence: {}", result.confidence);
// Or check the method first if you want to skip unsupported types early.
if extraction_method("application/pdf") != ExtractionMethod::Unsupported {
// …
}§What’s in the box
| Function | Purpose |
|---|---|
extract_content(data, content_type, ocr_langs) | One-call auto-dispatch (PDF or image) |
extract_pdf_text(data) | PDF embedded-text only |
ocr_image(data, langs) | OCR via tesseract CLI |
extraction_method(content_type) | Inspect which backend applies |
tesseract_available() | Spawn-test for tesseract binary |
ExtractionResult | text + language + confidence + page count + metadata |
MAX_EXTRACT_SIZE | Recommended 50 MiB input cap (caller enforces) |
§Supported types
| Content-Type | Method |
|---|---|
application/pdf | PDF text + OCR fallback |
image/png | OCR |
image/jpeg | OCR |
image/webp | OCR |
image/tiff | OCR |
image/bmp | OCR |
image/gif | OCR |
| anything else | unsupported (returns empty result) |
§Runtime requirements
- tesseract CLI for image OCR. Install via
brew install tesseract/apt install tesseract-ocr/ etc. Language packs optional but recommended (tesseract-ocr-jpn, etc.). - pdf-extract is a pure-Rust dep, no system requirement.
If tesseract isn’t installed, extract_content for image
content-types returns an Err. PDF extraction still works for
embedded-text PDFs.
§License
Apache-2.0 OR MIT.
Structs§
- Extraction
Result - Result of an extraction attempt — text content + provenance metadata (language hint, confidence, page count) suitable for indexing or embedding generation downstream.
Enums§
- Extraction
Method - Which extraction backend applies to a given
Content-Type.
Constants§
- MAX_
EXTRACT_ SIZE - Recommended upper bound on input size for
extract_content— 50 MiB. Caller’s choice whether to enforce; we don’t enforce internally because the right limit varies by deployment (an archive-grade system may want 500 MiB, a mobile MTA may want 5).
Functions§
- extract_
content - Auto-dispatch: pick the right extractor for
content_typeand run. - extract_
pdf_ text - Extract embedded text from a PDF (pure Rust via
pdf-extract). Confidence is1.0because embedded text is exact, not OCR’d.page_countis approximated by counting form-feed (\u{000C}) page-break markers; off-by-one is possible for malformed PDFs. - extraction_
method - Choose an
ExtractionMethodfrom aContent-Typestring. Case-insensitive. Unknown types fall through toUnsupported. - ocr_
image - OCR an image via the
tesseractCLI subprocess. - tesseract_
available - Check whether the
tesseractCLI binary is onPATH. Spawnstesseract --versionand checks for success — no caching. If you’ll call this on a hot path, cache the result yourself.