Omniparse
A Rust toolkit for detecting and extracting metadata, text, and content from hundreds of different file formats. Omniparse provides both a command-line interface and a library API, serving as a Rust equivalent to Apache Tika.
Features
- Automatic Type Detection: Identifies file types using magic bytes, content analysis, and extension fallback
- Multiple Format Support: Extracts content from 25+ formats across text, document, image, audio, and archive categories
- Rich Metadata Extraction: Full EXIF for JPEG/TIFF, OpenGraph / Twitter / canonical for HTML, ID3 for MP3, OPF for EPUB, version/encryption/forms/annotations for PDF, and more
- OCR Subsystem: Optional classical and ML OCR pipelines for images and scanned PDFs. Pure Rust. Models download on first use for the ML backend (or pre-fetch via
omniparse models download); classical backend has no external dependencies. - Production Web Service: Ship-ready Axum example (
examples/web_service_prod.rs) with Cloud Logging JSON, Prometheus/metrics, liveness/readiness probes, request limits, panic catcher, graceful shutdown — baked into the published Docker image and one-command deployable to Google Cloud Run viadeploy/cloud-run/deploy.sh. - Robust PDF parsing: Four-tier fallback chain (strict via
lopdf→ trailing-junk repair → raw stream-byte scan with FlateDecode/LZWDecode/ASCII85Decode → optionalpdf-extractfor linearized / Identity-H + /ToUnicode CMap PDFs). Real-world inputs from Lucidchart, Word print-to-PDF, browser print-to-PDF, truncated downloads — all yield text instead of"Invalid file trailer". Apdf_parse_strategymetadata field surfaces which tier ran. - Dual Interface: Use as a CLI tool or integrate as a library in your Rust applications
- Pure Rust Implementation: Minimal dependencies, no external system libraries required
- Async Support: Optional async API for non-blocking operations
- Parallel Processing: Batch process multiple files in parallel for better performance
- Streaming Support: Memory-efficient processing of large files
- Security Hardening: ZIP-bomb detection, XML entity limits, archive path-traversal detection, strict prototype validation
Supported Formats
Text Formats
- Plain Text (TXT)
- JSON
- CSV/TSV
- XML
- HTML (OpenGraph, Twitter Card, canonical URL, viewport, heading counts)
- CSS
- RTF (Rich Text Format)
- Markdown (via
pulldown-cmark, optionalmarkdownfeature, default on)
Document Formats
- Microsoft Word (DOCX, DOC)
- Microsoft Excel (XLSX, XLS)
- Microsoft PowerPoint (PPTX, PPT)
- OpenDocument Text (ODT)
- OpenDocument Spreadsheet (ODS)
- OpenDocument Presentation (ODP)
Document Formats (added)
- EPUB (OPF metadata, spine walk, chapter text — optional
epubfeature)
Image Formats
- JPEG (full EXIF via
kamadak-exif, optional OCR) - PNG (text chunks including decompressed zTXt/iTXt, optional OCR)
- TIFF (EXIF via shared helper, optional OCR)
- SVG (title, desc, viewBox, text nodes, element counts — optional
svgfeature) - WebP (dimensions, EXIF, optional OCR — optional
webpfeature)
Audio Formats
- MP3 (ID3v1/v2 tags — title, artist, album, genre, year, track, duration — optional
mp3feature)
Archive Formats
- ZIP (with path-traversal detection via
contains_unsafe_pathsmetadata) - TAR (with path-traversal detection)
Installation
As a Library
Add Omniparse to your Cargo.toml:
[]
= "0.4"
For async support:
[]
= { = "0.4", = ["async"] }
For parallel processing:
[]
= { = "0.4", = ["parallel"] }
For broader PDF coverage (Lucidchart / Word print-to-PDF / linearized PDFs that the default lopdf-based tiers can't load):
[]
= { = "0.4", = ["pdf-extract"] }
To build without PDF support at all (smaller dependency tree):
[]
= { = "0.4", = false, = ["markdown", "svg", "webp", "epub", "mp3"] }
Full feature reference: see Cargo features in this
README or the table in cargo doc --open.
Cargo features
| Feature | Default | Purpose |
|---|---|---|
pdf |
on | PDF parsing via lopdf + lenient raw_scan fallback (Flate/LZW/ASCII85) |
pdf-extract |
off | 4th-tier PDF fallback via pdf-extract (linearized / Identity-H CMaps) |
markdown |
on | Markdown parser |
svg |
on | SVG parser |
webp |
on | WebP parser |
epub |
on | EPUB parser |
mp3 |
on | MP3 ID3v1/v2 parser |
async |
off | tokio-backed extract_from_path_async |
parallel |
off | rayon-backed process_files_parallel |
ocr |
off | Classical OCR pipeline (pure Rust, no external deps) |
ocr-ml |
off | ML OCR backend via ocrs + rten, models auto-downloaded |
ocr-train |
off | TTF/OTF → prototype trainer for the classical OCR pipeline |
ocr-parallel |
off | Parallel per-region OCR recognition (implies ocr + parallel) |
OCR — quickstart
Two backends; one env var selects which runs.
# ML backend (recommended for photos / screenshots / unknown typography)
OMNIPARSE_OCR=ml
# Classical backend (pure Rust, no downloads — clean printed scans only)
OMNIPARSE_OCR=classical
Prefer a container? docker run --rm -p 3000:3000 ghcr.io/sirhco/omniparse-web:latest
launches the Axum web service with ML models baked in (see
Dockerfile and examples/WEB_SERVICE_GUIDE.md).
📖 Full OCR Guide → — backend chooser, model-cache CLI, training custom prototypes, tuning, debugging, library API, FAQ.
As a CLI Tool
Install using Cargo:
Or build from source:
The binary will be available at target/release/omniparse.
Library Usage
Basic Extraction
use extract_from_path;
Extract from Bytes
use extract_from_bytes;
Async Extraction
use extract_from_path_async;
async
Check Supported Formats
use ;
Batch Processing
use Extractor;
use process_files_parallel;
CLI Usage
Basic Extraction
# Extract from a single file
# Extract from multiple files
Output Formats
# JSON output
# YAML output
# Save to file
Metadata Only
# Extract only metadata, no content
Type Detection Only
# Detect file type without extraction
Parallel Processing
# Process multiple files in parallel
Verbose Output
# Enable verbose logging
Combined Options
# Metadata only, JSON format, parallel processing
Format-Specific Examples
# Extract from HTML files (web pages)
# Extract from CSS files (stylesheets)
# Extract from RTF files (rich text)
# Extract from spreadsheets (Excel and OpenDocument)
# Extract from presentations (PowerPoint and OpenDocument)
# Extract from legacy Office files (DOC, XLS, PPT)
# Mixed format batch processing
Error Handling
Omniparse provides detailed error types for different failure scenarios:
use ;
match extract_from_path
New Format Support
Omniparse has recently added support for 9 additional document formats:
Web Formats
- HTML: Extract visible text and metadata from web pages
- CSS: Analyze stylesheets with rule and selector counting
Office Formats
- XLSX/XLS: Extract data from Excel spreadsheets (modern and legacy)
- PPTX/PPT: Extract text from PowerPoint presentations (modern and legacy)
- DOC: Extract content from legacy Word documents
OpenDocument Formats
- ODS: Extract data from OpenDocument spreadsheets
- ODP: Extract text from OpenDocument presentations
Rich Text
- RTF: Extract plain text from Rich Text Format files
See SUPPORTED_FORMATS.md for detailed information about each format.
Performance
Omniparse is designed for performance:
- Streaming: Large files are processed using streaming to limit memory usage
- Parallel Processing: Batch operations can leverage multiple CPU cores
- Pure Rust: No FFI overhead or external process spawning
- Efficient Detection: Magic byte detection is fast and accurate
Typical performance on standard hardware:
- Text files (10 MB): < 100ms
- HTML files (1 MB): < 100ms (actual: ~0.6ms)
- PDF documents: 200-500ms depending on size
- XLSX files (10K cells): < 500ms (actual: ~0.9ms for small files)
- PPTX files (100 slides): < 1000ms (actual: ~0.6ms for small files)
- Image metadata: < 50ms
All performance targets met or exceeded. See FINAL_PERFORMANCE_SUMMARY.md for comprehensive benchmark results.
Architecture
Omniparse follows a modular architecture:
┌─────────────────┐
│ CLI / API │
└────────┬────────┘
│
┌────────▼────────┐
│ Extractor │
└────┬───────┬────┘
│ │
┌────▼───┐ ┌▼──────────┐
│Detector│ │ Registry │
└────────┘ └─────┬──────┘
│
┌───────┴───────┐
│ Parsers │
├───────────────┤
│ Text │
│ Document │
│ Image │
│ Archive │
└───────────────┘
- Extractor: Orchestrates detection and parsing
- Detector: Identifies file types using multiple methods
- Registry: Manages available parsers
- Parsers: Format-specific extraction implementations
Documentation
Version 0.4 (current)
- RELEASE_NOTES_v0.4.0.md -
omniparse modelsCLI, unifiedOMNIPARSE_OCRenv var, Dockerfile + GHCR image, production Cloud Run example - OCR_GUIDE.md - Single canonical OCR reference: backend chooser, model-cache CLI, training, tuning, debugging
- examples/WEB_SERVICE_GUIDE.md - Web service guide: minimal demo + production example + Cloud Run deploy
- CHANGELOG.md - Full changelog
Version 0.3
- RELEASE_NOTES_v0.3.0.md - v0.3.0 enhancements, feature flags, env var reference
- MIGRATION_v0.3.0.md - Upgrade guide from v0.2.x
General
- SUPPORTED_FORMATS.md - Complete list of supported formats
- examples/ - Working code examples for all formats and OCR modes
- API Documentation - Run
cargo doc --open --features "ocr-ml ocr-train"for full API docs
Historical
- CLI_NEW_FORMATS_GUIDE.md - v0.2 CLI guide for initially-added formats
- MIGRATION_GUIDE.md - v0.2 migration guide
Contributing
Contributions are welcome! Areas for contribution:
- Adding support for new file formats
- Improving type detection accuracy
- Performance optimizations
- Documentation improvements
- Bug fixes
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Acknowledgments
Inspired by Apache Tika, the Java-based content analysis toolkit.
Core dependencies
Pure-Rust crates carrying the heavy lifting. Every license below is permissive and compatible with omniparse's MIT/Apache-2.0 dual license. The only copyleft is cssparser's MPL-2.0, which is weak/file-level — it covers only that crate's own files and does not affect omniparse's license. A deny.toml policy enforces this (no GPL/AGPL allowed); see Dependency licensing below.
| Crate | Used for | License |
|---|---|---|
lopdf |
Strict-tier PDF parsing (xref / trailer / object dictionary, embedded-image extraction for OCR) | MIT |
pdf-extract |
4th-tier PDF fallback for linearized / Identity-H + /ToUnicode CMap PDFs (Lucidchart, Word print-to-PDF). Behind the pdf-extract feature |
MIT |
weezl |
LZWDecode stream filter in the raw_scan PDF fallback | MIT / Apache-2.0 |
ascii85 |
ASCII85Decode stream filter in the raw_scan PDF fallback | MIT / Apache-2.0 |
ocrs + rten |
ML OCR backend (text-detection + text-recognition models) | MIT / Apache-2.0 |
image |
Image decode | MIT / Apache-2.0 |
kamadak-exif |
EXIF metadata | BSD-2-Clause |
calamine |
XLSX / XLS / ODS parsing | MIT |
scraper + cssparser |
HTML + CSS parsing | ISC (scraper) / MPL-2.0 (cssparser) |
rbook |
EPUB 2/3 OPF metadata + reading-order text | Apache-2.0 |
id3 |
MP3 ID3v1/v2 tags | MIT |
zip |
ZIP / Office / EPUB container walking | MIT |
tar, flate2 |
TAR walking + deflate | MIT / Apache-2.0 |
Dependency licensing
omniparse is MIT OR Apache-2.0. To keep the dependency tree free of
unexpected copyleft, the repo ships a deny.toml policy
enforced in CI (cargo deny):
- Only permissive licenses are allowed; GPL / AGPL / standalone LGPL are rejected. A dependency carrying one fails the build.
- Weak file-level MPL-2.0 (e.g.
cssparser) is allowed only for the specific crates known to use it, via scoped exceptions — a new copyleft dependency still trips the check. - The check runs with all features enabled, so optional stacks (PDF, OCR) are covered too.
This was added after the epub crate (GPL-3.0) was found shipping in the
default feature set; it has since been replaced by rbook (Apache-2.0).