Omniparse
A Rust toolkit for detecting and extracting metadata, text, and content from hundreds of different file formats. Omniparse provides both a command-line interface and a library API, serving as a Rust equivalent to Apache Tika.
Features
- Automatic Type Detection: Identifies file types using magic bytes, content analysis, and extension fallback
- Multiple Format Support: Extracts content from 25+ formats across text, document, image, audio, and archive categories
- Rich Metadata Extraction: Full EXIF for JPEG/TIFF, OpenGraph / Twitter / canonical for HTML, ID3 for MP3, OPF for EPUB, version/encryption/forms/annotations for PDF, and more
- OCR Subsystem (v0.3): Optional classical and ML OCR pipelines for images and scanned PDFs. Pure Rust. Models download on first use for the ML backend; classical backend has no external dependencies.
- Dual Interface: Use as a CLI tool or integrate as a library in your Rust applications
- Pure Rust Implementation: Minimal dependencies, no external system libraries required
- Async Support: Optional async API for non-blocking operations
- Parallel Processing: Batch process multiple files in parallel for better performance
- Streaming Support: Memory-efficient processing of large files
- Security Hardening: ZIP-bomb detection, XML entity limits, archive path-traversal detection, strict prototype validation
Supported Formats
Text Formats
- Plain Text (TXT)
- JSON
- CSV/TSV
- XML
- HTML (OpenGraph, Twitter Card, canonical URL, viewport, heading counts)
- CSS
- RTF (Rich Text Format)
- Markdown (via
pulldown-cmark, optionalmarkdownfeature, default on)
Document Formats
- Microsoft Word (DOCX, DOC)
- Microsoft Excel (XLSX, XLS)
- Microsoft PowerPoint (PPTX, PPT)
- OpenDocument Text (ODT)
- OpenDocument Spreadsheet (ODS)
- OpenDocument Presentation (ODP)
Document Formats (added)
- EPUB (OPF metadata, spine walk, chapter text — optional
epubfeature)
Image Formats
- JPEG (full EXIF via
kamadak-exif, optional OCR) - PNG (text chunks including decompressed zTXt/iTXt, optional OCR)
- TIFF (EXIF via shared helper, optional OCR)
- SVG (title, desc, viewBox, text nodes, element counts — optional
svgfeature) - WebP (dimensions, EXIF, optional OCR — optional
webpfeature)
Audio Formats
- MP3 (ID3v1/v2 tags — title, artist, album, genre, year, track, duration — optional
mp3feature)
Archive Formats
- ZIP (with path-traversal detection via
contains_unsafe_pathsmetadata) - TAR (with path-traversal detection)
Installation
As a Library
Add Omniparse to your Cargo.toml:
[]
= "0.3"
For async support:
[]
= { = "0.3", = ["async"] }
For parallel processing:
[]
= { = "0.3", = ["parallel"] }
Two OCR backends
v0.3 ships two optional OCR backends. Pick one based on your inputs.
📖 Full OCR Guide → — training, tuning, debugging, API examples.
Classical — pure-algorithm pipeline. No ML runtime, no downloads.
[]
= { = "0.3", = ["ocr"] }
OCR is runtime-opt-in — set OMNIPARSE_OCR=1 (or configure the engine
explicitly) to activate it. See examples/ocr_basic.rs.
The bundled recognizer ships with 7×9 bitmap prototypes suitable only for
matching clean synthetic text. For real-world photos or documents, train a
prototype set from the actual typeface using the ocr-train feature:
[]
= { = "0.3", = ["ocr-train"] }
# generate prototypes from a font at a specific pixel size
# use them at runtime
OMNIPARSE_OCR=1 OMNIPARSE_OCR_PROTOTYPES=prototypes.json \
Tune OMNIPARSE_OCR_MIN_CONFIDENCE=<0.0..=1.0> to trade noise for recall
(default 0.15).
For photographs where text is overlaid on images, switch the layout analyzer to the Stroke-Width Transform:
OMNIPARSE_OCR=1 OMNIPARSE_OCR_LAYOUT=swt \
OMNIPARSE_OCR_PROTOTYPES=prototypes.json \
Multi-scale training improves recognition across different rendered sizes:
ML OCR backend (ocr-ml)
For photographic inputs where the classical pipeline's shape-feature classifier can't recover text, enable the ML backend:
[]
= { = "0.3", = ["ocr-ml"] }
OMNIPARSE_OCR=1 OMNIPARSE_OCR_ML=1 \
Uses ocrs + rten (both pure Rust, MIT). Pre-trained detection +
recognition models download once (~30 MB) to the user cache directory.
Override the cache location with OMNIPARSE_OCR_MODELS=<path>. No models
are bundled in the crate.
As a CLI Tool
Install using Cargo:
Or build from source:
The binary will be available at target/release/omniparse.
Library Usage
Basic Extraction
use extract_from_path;
Extract from Bytes
use extract_from_bytes;
Async Extraction
use extract_from_path_async;
async
Check Supported Formats
use ;
Batch Processing
use Extractor;
use process_files_parallel;
CLI Usage
Basic Extraction
# Extract from a single file
# Extract from multiple files
Output Formats
# JSON output
# YAML output
# Save to file
Metadata Only
# Extract only metadata, no content
Type Detection Only
# Detect file type without extraction
Parallel Processing
# Process multiple files in parallel
Verbose Output
# Enable verbose logging
Combined Options
# Metadata only, JSON format, parallel processing
Format-Specific Examples
# Extract from HTML files (web pages)
# Extract from CSS files (stylesheets)
# Extract from RTF files (rich text)
# Extract from spreadsheets (Excel and OpenDocument)
# Extract from presentations (PowerPoint and OpenDocument)
# Extract from legacy Office files (DOC, XLS, PPT)
# Mixed format batch processing
Error Handling
Omniparse provides detailed error types for different failure scenarios:
use ;
match extract_from_path
New Format Support
Omniparse has recently added support for 9 additional document formats:
Web Formats
- HTML: Extract visible text and metadata from web pages
- CSS: Analyze stylesheets with rule and selector counting
Office Formats
- XLSX/XLS: Extract data from Excel spreadsheets (modern and legacy)
- PPTX/PPT: Extract text from PowerPoint presentations (modern and legacy)
- DOC: Extract content from legacy Word documents
OpenDocument Formats
- ODS: Extract data from OpenDocument spreadsheets
- ODP: Extract text from OpenDocument presentations
Rich Text
- RTF: Extract plain text from Rich Text Format files
See SUPPORTED_FORMATS.md for detailed information about each format.
Performance
Omniparse is designed for performance:
- Streaming: Large files are processed using streaming to limit memory usage
- Parallel Processing: Batch operations can leverage multiple CPU cores
- Pure Rust: No FFI overhead or external process spawning
- Efficient Detection: Magic byte detection is fast and accurate
Typical performance on standard hardware:
- Text files (10 MB): < 100ms
- HTML files (1 MB): < 100ms (actual: ~0.6ms)
- PDF documents: 200-500ms depending on size
- XLSX files (10K cells): < 500ms (actual: ~0.9ms for small files)
- PPTX files (100 slides): < 1000ms (actual: ~0.6ms for small files)
- Image metadata: < 50ms
All performance targets met or exceeded. See FINAL_PERFORMANCE_SUMMARY.md for comprehensive benchmark results.
Architecture
Omniparse follows a modular architecture:
┌─────────────────┐
│ CLI / API │
└────────┬────────┘
│
┌────────▼────────┐
│ Extractor │
└────┬───────┬────┘
│ │
┌────▼───┐ ┌▼──────────┐
│Detector│ │ Registry │
└────────┘ └─────┬──────┘
│
┌───────┴───────┐
│ Parsers │
├───────────────┤
│ Text │
│ Document │
│ Image │
│ Archive │
└───────────────┘
- Extractor: Orchestrates detection and parsing
- Detector: Identifies file types using multiple methods
- Registry: Manages available parsers
- Parsers: Format-specific extraction implementations
Documentation
Version 0.3 (current)
- RELEASE_NOTES_v0.3.0.md - Complete list of v0.3.0 enhancements, feature flags, env var reference
- MIGRATION_v0.3.0.md - Upgrade guide from v0.2.x with breaking change details
- OCR_GUIDE.md - Full OCR subsystem guide: classical vs ML, training, tuning, debugging
- CHANGELOG.md - Full changelog
General
- SUPPORTED_FORMATS.md - Complete list of supported formats
- examples/ - Working code examples for all formats and OCR modes
- API Documentation - Run
cargo doc --open --features "ocr-ml ocr-train"for full API docs
Historical
- CLI_NEW_FORMATS_GUIDE.md - v0.2 CLI guide for initially-added formats
- MIGRATION_GUIDE.md - v0.2 migration guide
Contributing
Contributions are welcome! Areas for contribution:
- Adding support for new file formats
- Improving type detection accuracy
- Performance optimizations
- Documentation improvements
- Bug fixes
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Acknowledgments
Inspired by Apache Tika, the Java-based content analysis toolkit.