Skip to main content

Crate mailrs_attachment_extract

Crate mailrs_attachment_extract 

Source
Expand description

§mailrs-attachment-extract

Crates.io Docs.rs License

Extract text from email (or any) attachments — PDF via pdf-extract (pure Rust embedded-text path) and images via the tesseract CLI subprocess.

Designed for the case where you need plain-text from arbitrary attachments — indexing, search, embedding generation, spam scoring, LLM context. No linking against libtesseract C library; the crate shells out to the tesseract binary so it works wherever tesseract is installed and avoids the C-library bindings tax.

§Two-stage PDF fallback

PDFs come in two flavours:

  1. Real PDFs with embedded text — extract via pdf-extract in ~1 ms, confidence 1.0, exact.
  2. Scanned PDFs with image pages only — embedded extraction returns near-empty text; fall back to OCR on the raw PDF bytes (tesseract can handle PDFs directly), confidence ~0.85.

extract_content does the dispatch automatically — if embedded text is < 50 chars (heuristic), it tries OCR as fallback.

§Quick start

use mailrs_attachment_extract::{extract_content, extraction_method, ExtractionMethod};

let pdf_bytes: &[u8] = b"%PDF-1.0\n..."; // your PDF bytes

// Single auto-dispatch entrypoint.
let result = extract_content(pdf_bytes, "application/pdf", "eng").unwrap();
println!("text: {}", result.text);
println!("confidence: {}", result.confidence);

// Or check the method first if you want to skip unsupported types early.
if extraction_method("application/pdf") != ExtractionMethod::Unsupported {
    // …
}

§What’s in the box

FunctionPurpose
extract_content(data, content_type, ocr_langs)One-call auto-dispatch (PDF or image)
extract_pdf_text(data)PDF embedded-text only
ocr_image(data, langs)OCR via tesseract CLI
extraction_method(content_type)Inspect which backend applies
tesseract_available()Spawn-test for tesseract binary
ExtractionResulttext + language + confidence + page count + metadata
MAX_EXTRACT_SIZERecommended 50 MiB input cap (caller enforces)

§Supported types

Content-TypeMethod
application/pdfPDF text + OCR fallback
image/pngOCR
image/jpegOCR
image/webpOCR
image/tiffOCR
image/bmpOCR
image/gifOCR
anything elseunsupported (returns empty result)

§Runtime requirements

  • tesseract CLI for image OCR. Install via brew install tesseract / apt install tesseract-ocr / etc. Language packs optional but recommended (tesseract-ocr-jpn, etc.).
  • pdf-extract is a pure-Rust dep, no system requirement.

If tesseract isn’t installed, extract_content for image content-types returns an Err. PDF extraction still works for embedded-text PDFs.

§License

Apache-2.0 OR MIT.

Structs§

ExtractionResult
Result of an extraction attempt — text content + provenance metadata (language hint, confidence, page count) suitable for indexing or embedding generation downstream.

Enums§

ExtractionMethod
Which extraction backend applies to a given Content-Type.

Constants§

MAX_EXTRACT_SIZE
Recommended upper bound on input size for extract_content — 50 MiB. Caller’s choice whether to enforce; we don’t enforce internally because the right limit varies by deployment (an archive-grade system may want 500 MiB, a mobile MTA may want 5).

Functions§

extract_content
Auto-dispatch: pick the right extractor for content_type and run.
extract_pdf_text
Extract embedded text from a PDF (pure Rust via pdf-extract). Confidence is 1.0 because embedded text is exact, not OCR’d. page_count is approximated by counting form-feed (\u{000C}) page-break markers; off-by-one is possible for malformed PDFs.
extraction_method
Choose an ExtractionMethod from a Content-Type string. Case-insensitive. Unknown types fall through to Unsupported.
ocr_image
OCR an image via the tesseract CLI subprocess.
tesseract_available
Check whether the tesseract CLI binary is on PATH. Spawns tesseract --version and checks for success — no caching. If you’ll call this on a hot path, cache the result yourself.