Omniparse
A Rust toolkit for detecting and extracting metadata, text, and content from hundreds of different file formats. Omniparse provides both a command-line interface and a library API, serving as a Rust equivalent to Apache Tika.
Features
- Automatic Type Detection: Identifies file types using magic bytes, content analysis, and extension fallback
- Multiple Format Support: Extracts content from text, document, image, and archive formats
- Rich Metadata Extraction: Retrieves format-specific metadata including title, author, dates, and more
- Dual Interface: Use as a CLI tool or integrate as a library in your Rust applications
- Pure Rust Implementation: Minimal dependencies, no external system libraries required
- Async Support: Optional async API for non-blocking operations
- Parallel Processing: Batch process multiple files in parallel for better performance
- Streaming Support: Memory-efficient processing of large files
Supported Formats
Text Formats
- Plain Text (TXT)
- JSON
- CSV/TSV
- XML
- HTML
- CSS
- RTF (Rich Text Format)
Document Formats
- Microsoft Word (DOCX, DOC)
- Microsoft Excel (XLSX, XLS)
- Microsoft PowerPoint (PPTX, PPT)
- OpenDocument Text (ODT)
- OpenDocument Spreadsheet (ODS)
- OpenDocument Presentation (ODP)
Image Formats
- JPEG (with EXIF metadata)
- PNG (with metadata chunks)
- TIFF (with tags)
Archive Formats
- ZIP
- TAR
Installation
As a Library
Add Omniparse to your Cargo.toml:
[]
= "0.1"
For async support:
[]
= { = "0.1", = ["async"] }
For parallel processing:
[]
= { = "0.1", = ["parallel"] }
As a CLI Tool
Install using Cargo:
Or build from source:
The binary will be available at target/release/omniparse.
Library Usage
Basic Extraction
use extract_from_path;
Extract from Bytes
use extract_from_bytes;
Async Extraction
use extract_from_path_async;
async
Check Supported Formats
use ;
Batch Processing
use Extractor;
use process_files_parallel;
CLI Usage
Basic Extraction
# Extract from a single file
# Extract from multiple files
Output Formats
# JSON output
# YAML output
# Save to file
Metadata Only
# Extract only metadata, no content
Type Detection Only
# Detect file type without extraction
Parallel Processing
# Process multiple files in parallel
Verbose Output
# Enable verbose logging
Combined Options
# Metadata only, JSON format, parallel processing
Format-Specific Examples
# Extract from HTML files (web pages)
# Extract from CSS files (stylesheets)
# Extract from RTF files (rich text)
# Extract from spreadsheets (Excel and OpenDocument)
# Extract from presentations (PowerPoint and OpenDocument)
# Extract from legacy Office files (DOC, XLS, PPT)
# Mixed format batch processing
Error Handling
Omniparse provides detailed error types for different failure scenarios:
use ;
match extract_from_path
New Format Support
Omniparse has recently added support for 9 additional document formats:
Web Formats
- HTML: Extract visible text and metadata from web pages
- CSS: Analyze stylesheets with rule and selector counting
Office Formats
- XLSX/XLS: Extract data from Excel spreadsheets (modern and legacy)
- PPTX/PPT: Extract text from PowerPoint presentations (modern and legacy)
- DOC: Extract content from legacy Word documents
OpenDocument Formats
- ODS: Extract data from OpenDocument spreadsheets
- ODP: Extract text from OpenDocument presentations
Rich Text
- RTF: Extract plain text from Rich Text Format files
See SUPPORTED_FORMATS.md for detailed information about each format.
Performance
Omniparse is designed for performance:
- Streaming: Large files are processed using streaming to limit memory usage
- Parallel Processing: Batch operations can leverage multiple CPU cores
- Pure Rust: No FFI overhead or external process spawning
- Efficient Detection: Magic byte detection is fast and accurate
Typical performance on standard hardware:
- Text files (10 MB): < 100ms
- HTML files (1 MB): < 100ms (actual: ~0.6ms)
- PDF documents: 200-500ms depending on size
- XLSX files (10K cells): < 500ms (actual: ~0.9ms for small files)
- PPTX files (100 slides): < 1000ms (actual: ~0.6ms for small files)
- Image metadata: < 50ms
All performance targets met or exceeded. See FINAL_PERFORMANCE_SUMMARY.md for comprehensive benchmark results.
Architecture
Omniparse follows a modular architecture:
┌─────────────────┐
│ CLI / API │
└────────┬────────┘
│
┌────────▼────────┐
│ Extractor │
└────┬───────┬────┘
│ │
┌────▼───┐ ┌▼──────────┐
│Detector│ │ Registry │
└────────┘ └─────┬──────┘
│
┌───────┴───────┐
│ Parsers │
├───────────────┤
│ Text │
│ Document │
│ Image │
│ Archive │
└───────────────┘
- Extractor: Orchestrates detection and parsing
- Detector: Identifies file types using multiple methods
- Registry: Manages available parsers
- Parsers: Format-specific extraction implementations
Documentation
- SUPPORTED_FORMATS.md - Complete list of supported formats with detailed information
- CLI_NEW_FORMATS_GUIDE.md - Comprehensive CLI guide for all newly added formats
- MIGRATION_GUIDE.md - Guide for upgrading to the latest version with new format support
- examples/ - Working code examples for all formats
- API Documentation - Run
cargo doc --openfor detailed API docs
Contributing
Contributions are welcome! Areas for contribution:
- Adding support for new file formats
- Improving type detection accuracy
- Performance optimizations
- Documentation improvements
- Bug fixes
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Acknowledgments
Inspired by Apache Tika, the Java-based content analysis toolkit.