Expand description
§Omniparse — Rust content extraction toolkit
Apache-Tika-style detection and extraction for 25+ file formats. Pure Rust, no system libraries, optional async / parallel / OCR.
§Supported formats
- Text: Plain text, JSON, CSV/TSV, XML, HTML (OpenGraph, Twitter, canonical URL, heading counts), CSS, RTF, Markdown
- Documents: PDF (version, encryption, form-field / annotation / attachment counts), DOCX, DOC, XLSX, XLS, PPTX, PPT, ODT, ODS, ODP, EPUB
- Images: JPEG (full EXIF), PNG (decompressed zTXt/iTXt), TIFF,
SVG, WebP. Optional OCR routes image text to
Content::Text. - Audio: MP3 (ID3v1/v2)
- Archives: ZIP, TAR (with path-traversal detection)
See SUPPORTED_FORMATS.md for per-format metadata keys.
§Cargo features
| Feature | Default | Purpose |
|---|---|---|
async | off | Tokio-based async extraction |
parallel | off | Rayon-based batch processing |
markdown | on | Markdown parser |
svg | on | SVG parser |
webp | on | WebP parser |
epub | on | EPUB parser |
mp3 | on | MP3 parser |
pdf | on | PDF parser via lopdf + lenient fallback (weezl / ascii85) |
pdf-extract | off | 4th-tier PDF fallback via pdf-extract (linearized / Identity-H PDFs) |
ocr | off | Classical OCR pipeline |
ocr-train | off | TTF → prototype trainer |
ocr-parallel | off | Parallel per-region recognition |
ocr-ml | off | ML OCR backend (ocrs + rten) |
§Acknowledgments
Omniparse stands on the shoulders of several pure-Rust libraries. The PDF tier specifically uses:
lopdf— strict-tier PDF parser (xref / trailer / object dictionary parse, embedded-image extraction for the OCR path). MIT licensed.weezl— LZWDecode filter support in the raw_scan fallback. MIT/Apache-2.0.ascii85— ASCII85Decode filter support in the raw_scan fallback. MIT/Apache-2.0.pdf-extract(optional, behind thepdf-extractfeature) — 4th-tier text extraction for PDFs that lopdf can’t load. MIT licensed.
EPUB support uses:
rbook(behind theepubfeature) — EPUB 2/3 OPF metadata + reading-order text extraction. Apache-2.0.
See Cargo.toml for the full dependency tree and per-crate version
pins. A deny.toml policy (enforced in CI via cargo deny) keeps the
dependency tree free of GPL/AGPL copyleft.
§PDF parsing tiers
Real-world PDFs are messy — truncated downloads, linearized exports, Identity-H + /ToUnicode CMaps, and appended HTTP-chunk garbage all defeat strict parsers. Omniparse’s PDF parser is a four-tier fallback chain so the caller almost always gets text:
- strict —
lopdf::Document::load_mem. Full metadata + per-page text. Most well-formed PDFs. - repaired_xref — truncate trailing bytes after the last
%%EOF, retry strict load. Catches HTTP-chunk leftovers / double-%%EOF. - raw_scan — walk
stream/endstreambyte ranges, decode FlateDecode / LZWDecode / ASCII85Decode / uncompressed payloads, regex-extractTj/TJoperators. Recovers text from PDFs lopdf can’t load. Output gated by a “looks-like-text” heuristic so glyph-index / encrypted bytes don’t reach the caller. - pdf_extract (only with
--features pdf-extract) — re-parse viapdf-extract. Tolerates linearized PDFs + Identity-H + /ToUnicode CMaps (Lucidchart, Word print-to-PDF, browser print-to-PDF).
Every successful response carries a pdf_parse_strategy metadata
field ("strict" / "repaired_xref" / "raw_scan" / "pdf_extract").
Tiers 2–4 also set pdf_parse_partial = true and
pdf_parse_error = "<original lopdf error>". Tier 4 is the most
important opt-in for shops processing Lucidchart or Word-print
exports.
let result = omniparse::extract_from_path("document.pdf")?;
if let Some(strategy) = result.metadata.get("pdf_parse_strategy") {
println!("PDF parsed via tier: {strategy:?}");
}§Web service example
See examples/web_service_prod.rs for a Cloud Run-ready Axum service
that wraps this library: Cloud Logging JSON output, Prometheus
/metrics, /live + /ready probes, body-size + timeout +
concurrency limits, panic catcher, graceful shutdown, and a
--healthcheck mode for distroless containers. The published
Docker image uses this binary as its ENTRYPOINT.
§Quickstart
use omniparse::extract_from_path;
let result = extract_from_path("document.pdf")?;
println!("MIME type: {}", result.mime_type);
println!("Content: {:?}", result.content);§Extract from HTML
use omniparse::extract_from_path;
let result = extract_from_path("webpage.html")?;
if let Some(title) = result.metadata.get("title") {
println!("Page title: {:?}", title);
}
// v0.3: OpenGraph, Twitter, canonical URL, heading counts also available.
if let Some(og_title) = result.metadata.get("og_title") {
println!("og:title = {:?}", og_title);
}§Extract from spreadsheets
use omniparse::extract_from_path;
// Works with XLSX, XLS, and ODS
let result = extract_from_path("data.xlsx")?;
if let Some(sheet_count) = result.metadata.get("sheet_count") {
println!("Number of sheets: {:?}", sheet_count);
}§Extract from bytes with MIME type hint
use omniparse::extract_from_bytes;
let data = std::fs::read("file.json")?;
let result = extract_from_bytes(&data, Some("application/json"))?;§Check supported formats
use omniparse::{supported_mime_types, is_mime_supported};
let types = supported_mime_types();
println!("Supported types: {:?}", types);
if is_mime_supported("application/pdf") {
println!("PDF is supported!");
}§OCR
Off by default. One env var selects the backend at runtime:
OMNIPARSE_OCR=classical— pure-Rust classical pipeline (ocrfeature)OMNIPARSE_OCR=ml— ML backend viaocrs+rten(ocr-mlfeature)OMNIPARSE_OCR=off/ unset — OCR disabled (image parsers extract EXIF only)
Image and PDF parsers automatically route through OCR when the gate is
set and populate ocr_status / ocr_confidence / ocr_applied
metadata.
// OMNIPARSE_OCR=classical (or =ml) activates OCR for image parsers.
let result = omniparse::extract_from_path("photo.jpg")?;
if let Some(status) = result.metadata.get("ocr_status") {
println!("ocr_status = {status:?}");
}Direct library use of the classical engine:
use omniparse::ocr::OcrEngine;
let engine = OcrEngine::new();
let image = image::open("page.png").unwrap();
let output = engine.recognize(image)?;
println!("{}", output.text);ML backend (requires ocr-ml feature; pre-trained models are downloaded
and SHA-256-verified on first use, or pre-fetched via the CLI
omniparse models download):
let engine = omniparse::ocr::ml::MlOcrEngine::new()?;
let image = image::open("photo.jpg").unwrap();
let output = engine.recognize(image)?;
println!("{}", output.text);See OCR_GUIDE.md for the model-cache CLI, training custom
prototypes, tuning, debugging, and the full env-var reference.
Re-exports§
pub use core::Error;pub use core::Result;pub use core::result::Content;pub use core::result::ExtractionResult;pub use core::result::Metadata;pub use core::result::MetadataValue;
Modules§
- core
- Core types and functionality for Omniparse
- detection
- File type detection functionality
- parsers
- Parser implementations for various file formats
- utils
- Utility functions and helpers
Functions§
- extract_
from_ bytes - Extract text and metadata from a byte slice.
- extract_
from_ path - Extract text and metadata from a file at the specified path.
- is_
mime_ supported - Check if a specific MIME type is supported.
- supported_
mime_ types - Get a list of all supported MIME types.