Skip to main content

Crate omniparse

Crate omniparse 

Source
Expand description

§Omniparse — Rust content extraction toolkit

Apache-Tika-style detection and extraction for 25+ file formats. Pure Rust, no system libraries, optional async / parallel / OCR.

§Supported formats

  • Text: Plain text, JSON, CSV/TSV, XML, HTML (OpenGraph, Twitter, canonical URL, heading counts), CSS, RTF, Markdown
  • Documents: PDF (version, encryption, form-field / annotation / attachment counts), DOCX, DOC, XLSX, XLS, PPTX, PPT, ODT, ODS, ODP, EPUB
  • Images: JPEG (full EXIF), PNG (decompressed zTXt/iTXt), TIFF, SVG, WebP. Optional OCR routes image text to Content::Text.
  • Audio: MP3 (ID3v1/v2)
  • Archives: ZIP, TAR (with path-traversal detection)

See SUPPORTED_FORMATS.md for per-format metadata keys.

§Cargo features

FeatureDefaultPurpose
asyncoffTokio-based async extraction
paralleloffRayon-based batch processing
markdownonMarkdown parser
svgonSVG parser
webponWebP parser
epubonEPUB parser
mp3onMP3 parser
pdfonPDF parser via lopdf + lenient fallback (weezl / ascii85)
pdf-extractoff4th-tier PDF fallback via pdf-extract (linearized / Identity-H PDFs)
ocroffClassical OCR pipeline
ocr-trainoffTTF → prototype trainer
ocr-paralleloffParallel per-region recognition
ocr-mloffML OCR backend (ocrs + rten)

§Acknowledgments

Omniparse stands on the shoulders of several pure-Rust libraries. The PDF tier specifically uses:

  • lopdf — strict-tier PDF parser (xref / trailer / object dictionary parse, embedded-image extraction for the OCR path). MIT licensed.
  • weezl — LZWDecode filter support in the raw_scan fallback. MIT/Apache-2.0.
  • ascii85 — ASCII85Decode filter support in the raw_scan fallback. MIT/Apache-2.0.
  • pdf-extract (optional, behind the pdf-extract feature) — 4th-tier text extraction for PDFs that lopdf can’t load. MIT licensed.

EPUB support uses:

  • rbook (behind the epub feature) — EPUB 2/3 OPF metadata + reading-order text extraction. Apache-2.0.

See Cargo.toml for the full dependency tree and per-crate version pins. A deny.toml policy (enforced in CI via cargo deny) keeps the dependency tree free of GPL/AGPL copyleft.

§PDF parsing tiers

Real-world PDFs are messy — truncated downloads, linearized exports, Identity-H + /ToUnicode CMaps, and appended HTTP-chunk garbage all defeat strict parsers. Omniparse’s PDF parser is a four-tier fallback chain so the caller almost always gets text:

  1. strictlopdf::Document::load_mem. Full metadata + per-page text. Most well-formed PDFs.
  2. repaired_xref — truncate trailing bytes after the last %%EOF, retry strict load. Catches HTTP-chunk leftovers / double-%%EOF.
  3. raw_scan — walk stream/endstream byte ranges, decode FlateDecode / LZWDecode / ASCII85Decode / uncompressed payloads, regex-extract Tj / TJ operators. Recovers text from PDFs lopdf can’t load. Output gated by a “looks-like-text” heuristic so glyph-index / encrypted bytes don’t reach the caller.
  4. pdf_extract (only with --features pdf-extract) — re-parse via pdf-extract. Tolerates linearized PDFs + Identity-H + /ToUnicode CMaps (Lucidchart, Word print-to-PDF, browser print-to-PDF).

Every successful response carries a pdf_parse_strategy metadata field ("strict" / "repaired_xref" / "raw_scan" / "pdf_extract"). Tiers 2–4 also set pdf_parse_partial = true and pdf_parse_error = "<original lopdf error>". Tier 4 is the most important opt-in for shops processing Lucidchart or Word-print exports.

let result = omniparse::extract_from_path("document.pdf")?;
if let Some(strategy) = result.metadata.get("pdf_parse_strategy") {
    println!("PDF parsed via tier: {strategy:?}");
}

§Web service example

See examples/web_service_prod.rs for a Cloud Run-ready Axum service that wraps this library: Cloud Logging JSON output, Prometheus /metrics, /live + /ready probes, body-size + timeout + concurrency limits, panic catcher, graceful shutdown, and a --healthcheck mode for distroless containers. The published Docker image uses this binary as its ENTRYPOINT.

§Quickstart

use omniparse::extract_from_path;

let result = extract_from_path("document.pdf")?;
println!("MIME type: {}", result.mime_type);
println!("Content: {:?}", result.content);

§Extract from HTML

use omniparse::extract_from_path;

let result = extract_from_path("webpage.html")?;
if let Some(title) = result.metadata.get("title") {
    println!("Page title: {:?}", title);
}
// v0.3: OpenGraph, Twitter, canonical URL, heading counts also available.
if let Some(og_title) = result.metadata.get("og_title") {
    println!("og:title = {:?}", og_title);
}

§Extract from spreadsheets

use omniparse::extract_from_path;

// Works with XLSX, XLS, and ODS
let result = extract_from_path("data.xlsx")?;
if let Some(sheet_count) = result.metadata.get("sheet_count") {
    println!("Number of sheets: {:?}", sheet_count);
}

§Extract from bytes with MIME type hint

use omniparse::extract_from_bytes;

let data = std::fs::read("file.json")?;
let result = extract_from_bytes(&data, Some("application/json"))?;

§Check supported formats

use omniparse::{supported_mime_types, is_mime_supported};

let types = supported_mime_types();
println!("Supported types: {:?}", types);

if is_mime_supported("application/pdf") {
    println!("PDF is supported!");
}

§OCR

Off by default. One env var selects the backend at runtime:

  • OMNIPARSE_OCR=classical — pure-Rust classical pipeline (ocr feature)
  • OMNIPARSE_OCR=ml — ML backend via ocrs + rten (ocr-ml feature)
  • OMNIPARSE_OCR=off / unset — OCR disabled (image parsers extract EXIF only)

Image and PDF parsers automatically route through OCR when the gate is set and populate ocr_status / ocr_confidence / ocr_applied metadata.

// OMNIPARSE_OCR=classical (or =ml) activates OCR for image parsers.
let result = omniparse::extract_from_path("photo.jpg")?;
if let Some(status) = result.metadata.get("ocr_status") {
    println!("ocr_status = {status:?}");
}

Direct library use of the classical engine:

use omniparse::ocr::OcrEngine;
let engine = OcrEngine::new();
let image = image::open("page.png").unwrap();
let output = engine.recognize(image)?;
println!("{}", output.text);

ML backend (requires ocr-ml feature; pre-trained models are downloaded and SHA-256-verified on first use, or pre-fetched via the CLI omniparse models download):

let engine = omniparse::ocr::ml::MlOcrEngine::new()?;
let image = image::open("photo.jpg").unwrap();
let output = engine.recognize(image)?;
println!("{}", output.text);

See OCR_GUIDE.md for the model-cache CLI, training custom prototypes, tuning, debugging, and the full env-var reference.

Re-exports§

pub use core::Error;
pub use core::Result;
pub use core::result::Content;
pub use core::result::ExtractionResult;
pub use core::result::Metadata;
pub use core::result::MetadataValue;

Modules§

core
Core types and functionality for Omniparse
detection
File type detection functionality
parsers
Parser implementations for various file formats
utils
Utility functions and helpers

Functions§

extract_from_bytes
Extract text and metadata from a byte slice.
extract_from_path
Extract text and metadata from a file at the specified path.
is_mime_supported
Check if a specific MIME type is supported.
supported_mime_types
Get a list of all supported MIME types.