Crate omniparse

Expand description

§Omniparse — Rust content extraction toolkit

Apache-Tika-style detection and extraction for 25+ file formats. Pure Rust, no system libraries, optional async / parallel / OCR.

§Supported formats

Text: Plain text, JSON, CSV/TSV, XML, HTML (OpenGraph, Twitter, canonical URL, heading counts), CSS, RTF, Markdown
Documents: PDF (version, encryption, form-field / annotation / attachment counts), DOCX, DOC, XLSX, XLS, PPTX, PPT, ODT, ODS, ODP, EPUB
Images: JPEG (full EXIF), PNG (decompressed zTXt/iTXt), TIFF, SVG, WebP. Optional OCR routes image text to Content::Text.
Audio: MP3 (ID3v1/v2)
Archives: ZIP, TAR (with path-traversal detection)

See SUPPORTED_FORMATS.md for per-format metadata keys.

§Cargo features

Feature	Default	Purpose
`async`	off	Tokio-based async extraction
`parallel`	off	Rayon-based batch processing
`markdown`	on	Markdown parser
`svg`	on	SVG parser
`webp`	on	WebP parser
`epub`	on	EPUB parser
`mp3`	on	MP3 parser
`pdf`	on	PDF parser via `lopdf` + lenient fallback (`weezl` / `ascii85`)
`pdf-extract`	off	4th-tier PDF fallback via `pdf-extract` (linearized / Identity-H PDFs)
`ocr`	off	Classical OCR pipeline
`ocr-train`	off	TTF → prototype trainer
`ocr-parallel`	off	Parallel per-region recognition
`ocr-ml`	off	ML OCR backend (ocrs + rten)

§Acknowledgments

Omniparse stands on the shoulders of several pure-Rust libraries. The PDF tier specifically uses:

lopdf — strict-tier PDF parser (xref / trailer / object dictionary parse, embedded-image extraction for the OCR path). MIT licensed.
weezl — LZWDecode filter support in the raw_scan fallback. MIT/Apache-2.0.
ascii85 — ASCII85Decode filter support in the raw_scan fallback. MIT/Apache-2.0.
pdf-extract (optional, behind the pdf-extract feature) — 4th-tier text extraction for PDFs that lopdf can’t load. MIT licensed.

EPUB support uses:

rbook (behind the epub feature) — EPUB 2/3 OPF metadata + reading-order text extraction. Apache-2.0.

See Cargo.toml for the full dependency tree and per-crate version pins. A deny.toml policy (enforced in CI via cargo deny) keeps the dependency tree free of GPL/AGPL copyleft.

§PDF parsing tiers

Real-world PDFs are messy — truncated downloads, linearized exports, Identity-H + /ToUnicode CMaps, and appended HTTP-chunk garbage all defeat strict parsers. Omniparse’s PDF parser is a four-tier fallback chain so the caller almost always gets text:

strict — lopdf::Document::load_mem. Full metadata + per-page text. Most well-formed PDFs.
repaired_xref — truncate trailing bytes after the last %%EOF, retry strict load. Catches HTTP-chunk leftovers / double-%%EOF.
raw_scan — walk stream/endstream byte ranges, decode FlateDecode / LZWDecode / ASCII85Decode / uncompressed payloads, regex-extract Tj / TJ operators. Recovers text from PDFs lopdf can’t load. Output gated by a “looks-like-text” heuristic so glyph-index / encrypted bytes don’t reach the caller.
pdf_extract (only with --features pdf-extract) — re-parse via pdf-extract. Tolerates linearized PDFs + Identity-H + /ToUnicode CMaps (Lucidchart, Word print-to-PDF, browser print-to-PDF).

Every successful response carries a pdf_parse_strategy metadata field ("strict" / "repaired_xref" / "raw_scan" / "pdf_extract"). Tiers 2–4 also set pdf_parse_partial = true and pdf_parse_error = "<original lopdf error>". Tier 4 is the most important opt-in for shops processing Lucidchart or Word-print exports.

let result = omniparse::extract_from_path("document.pdf")?;
if let Some(strategy) = result.metadata.get("pdf_parse_strategy") {
    println!("PDF parsed via tier: {strategy:?}");
}

§Web service example

See examples/web_service_prod.rs for a Cloud Run-ready Axum service that wraps this library: Cloud Logging JSON output, Prometheus /metrics, /live + /ready probes, body-size + timeout + concurrency limits, panic catcher, graceful shutdown, and a --healthcheck mode for distroless containers. The published Docker image uses this binary as its ENTRYPOINT.

§Quickstart

use omniparse::extract_from_path;

let result = extract_from_path("document.pdf")?;
println!("MIME type: {}", result.mime_type);
println!("Content: {:?}", result.content);

§Extract from HTML

use omniparse::extract_from_path;

let result = extract_from_path("webpage.html")?;
if let Some(title) = result.metadata.get("title") {
    println!("Page title: {:?}", title);
}
// v0.3: OpenGraph, Twitter, canonical URL, heading counts also available.
if let Some(og_title) = result.metadata.get("og_title") {
    println!("og:title = {:?}", og_title);
}

§Extract from spreadsheets

use omniparse::extract_from_path;

// Works with XLSX, XLS, and ODS
let result = extract_from_path("data.xlsx")?;
if let Some(sheet_count) = result.metadata.get("sheet_count") {
    println!("Number of sheets: {:?}", sheet_count);
}

§Extract from bytes with MIME type hint

use omniparse::extract_from_bytes;

let data = std::fs::read("file.json")?;
let result = extract_from_bytes(&data, Some("application/json"))?;

§Check supported formats

use omniparse::{supported_mime_types, is_mime_supported};

let types = supported_mime_types();
println!("Supported types: {:?}", types);

if is_mime_supported("application/pdf") {
    println!("PDF is supported!");
}

§OCR

Off by default. One env var selects the backend at runtime:

OMNIPARSE_OCR=classical — pure-Rust classical pipeline (ocr feature)
OMNIPARSE_OCR=ml — ML backend via ocrs + rten (ocr-ml feature)
OMNIPARSE_OCR=off / unset — OCR disabled (image parsers extract EXIF only)

Image and PDF parsers automatically route through OCR when the gate is set and populate ocr_status / ocr_confidence / ocr_applied metadata.

// OMNIPARSE_OCR=classical (or =ml) activates OCR for image parsers.
let result = omniparse::extract_from_path("photo.jpg")?;
if let Some(status) = result.metadata.get("ocr_status") {
    println!("ocr_status = {status:?}");
}

Direct library use of the classical engine:

use omniparse::ocr::OcrEngine;
let engine = OcrEngine::new();
let image = image::open("page.png").unwrap();
let output = engine.recognize(image)?;
println!("{}", output.text);

ML backend (requires ocr-ml feature; pre-trained models are downloaded and SHA-256-verified on first use, or pre-fetched via the CLI omniparse models download):

let engine = omniparse::ocr::ml::MlOcrEngine::new()?;
let image = image::open("photo.jpg").unwrap();
let output = engine.recognize(image)?;
println!("{}", output.text);

See OCR_GUIDE.md for the model-cache CLI, training custom prototypes, tuning, debugging, and the full env-var reference.

Re-exports§

pub use core::Error;
pub use core::Result;
pub use core::result::Content;
pub use core::result::ExtractionResult;
pub use core::result::Metadata;
pub use core::result::MetadataValue;

Modules§

core: Core types and functionality for Omniparse
detection: File type detection functionality
parsers: Parser implementations for various file formats
utils: Utility functions and helpers

Functions§

extract_from_bytes: Extract text and metadata from a byte slice.
extract_from_path: Extract text and metadata from a file at the specified path.
is_mime_supported: Check if a specific MIME type is supported.
supported_mime_types: Get a list of all supported MIME types.

Crate omniparse

Crate omniparse Copy item path

§Omniparse — Rust content extraction toolkit

§Supported formats

§Cargo features

§Acknowledgments

§PDF parsing tiers

§Web service example

§Quickstart

§Extract from HTML

§Extract from spreadsheets

§Extract from bytes with MIME type hint

§Check supported formats

§OCR

Re-exports§

Modules§

Functions§

Crate omniparse