Omniparse

A Rust toolkit for detecting and extracting metadata, text, and content from hundreds of different file formats. Omniparse provides both a command-line interface and a library API, serving as a Rust equivalent to Apache Tika.

Features

Automatic Type Detection: Identifies file types using magic bytes, content analysis, and extension fallback
Multiple Format Support: Extracts content from 25+ formats across text, document, image, audio, and archive categories
Rich Metadata Extraction: Full EXIF for JPEG/TIFF, OpenGraph / Twitter / canonical for HTML, ID3 for MP3, OPF for EPUB, version/encryption/forms/annotations for PDF, and more
OCR Subsystem (v0.3): Optional classical and ML OCR pipelines for images and scanned PDFs. Pure Rust. Models download on first use for the ML backend; classical backend has no external dependencies.
Dual Interface: Use as a CLI tool or integrate as a library in your Rust applications
Pure Rust Implementation: Minimal dependencies, no external system libraries required
Async Support: Optional async API for non-blocking operations
Parallel Processing: Batch process multiple files in parallel for better performance
Streaming Support: Memory-efficient processing of large files
Security Hardening: ZIP-bomb detection, XML entity limits, archive path-traversal detection, strict prototype validation

Supported Formats

Text Formats

Plain Text (TXT)
JSON
CSV/TSV
XML
HTML (OpenGraph, Twitter Card, canonical URL, viewport, heading counts)
CSS
RTF (Rich Text Format)
Markdown (via pulldown-cmark, optional markdown feature, default on)

Document Formats

PDF
Microsoft Word (DOCX, DOC)
Microsoft Excel (XLSX, XLS)
Microsoft PowerPoint (PPTX, PPT)
OpenDocument Text (ODT)
OpenDocument Spreadsheet (ODS)
OpenDocument Presentation (ODP)

Document Formats (added)

EPUB (OPF metadata, spine walk, chapter text — optional epub feature)

Image Formats

JPEG (full EXIF via kamadak-exif, optional OCR)
PNG (text chunks including decompressed zTXt/iTXt, optional OCR)
TIFF (EXIF via shared helper, optional OCR)
SVG (title, desc, viewBox, text nodes, element counts — optional svg feature)
WebP (dimensions, EXIF, optional OCR — optional webp feature)

Audio Formats

MP3 (ID3v1/v2 tags — title, artist, album, genre, year, track, duration — optional mp3 feature)

Archive Formats

ZIP (with path-traversal detection via contains_unsafe_paths metadata)
TAR (with path-traversal detection)

Installation

As a Library

Add Omniparse to your Cargo.toml:

[dependencies]
omniparse = "0.3"

For async support:

[dependencies]
omniparse = { version = "0.3", features = ["async"] }

For parallel processing:

[dependencies]
omniparse = { version = "0.3", features = ["parallel"] }

Two OCR backends

v0.3 ships two optional OCR backends. Pick one based on your inputs.

📖 Full OCR Guide → — training, tuning, debugging, API examples.

Classical — pure-algorithm pipeline. No ML runtime, no downloads.

[dependencies]
omniparse = { version = "0.3", features = ["ocr"] }

OCR is runtime-opt-in — set OMNIPARSE_OCR=1 (or configure the engine explicitly) to activate it. See examples/ocr_basic.rs.

The bundled recognizer ships with 7×9 bitmap prototypes suitable only for matching clean synthetic text. For real-world photos or documents, train a prototype set from the actual typeface using the ocr-train feature:

[dependencies]
omniparse = { version = "0.3", features = ["ocr-train"] }

# generate prototypes from a font at a specific pixel size
cargo run --features ocr-train --example train_prototypes -- \
    /path/to/Font.ttf prototypes.json 48

# use them at runtime
OMNIPARSE_OCR=1 OMNIPARSE_OCR_PROTOTYPES=prototypes.json \
    cargo run --features ocr --release -- image.jpg

Tune OMNIPARSE_OCR_MIN_CONFIDENCE=<0.0..=1.0> to trade noise for recall (default 0.15).

For photographs where text is overlaid on images, switch the layout analyzer to the Stroke-Width Transform:

OMNIPARSE_OCR=1 OMNIPARSE_OCR_LAYOUT=swt \
    OMNIPARSE_OCR_PROTOTYPES=prototypes.json \
    cargo run --features ocr --release -- photo.jpg

Multi-scale training improves recognition across different rendered sizes:

cargo run --features ocr-train --example train_prototypes -- \
    /path/to/Font.ttf prototypes.json 24,48,96

ML OCR backend (`ocr-ml`)

For photographic inputs where the classical pipeline's shape-feature classifier can't recover text, enable the ML backend:

[dependencies]
omniparse = { version = "0.3", features = ["ocr-ml"] }

OMNIPARSE_OCR=1 OMNIPARSE_OCR_ML=1 \
    cargo run --features ocr-ml --release -- photo.jpg

Uses ocrs + rten (both pure Rust, MIT). Pre-trained detection + recognition models download once (~30 MB) to the user cache directory. Override the cache location with OMNIPARSE_OCR_MODELS=<path>. No models are bundled in the crate.

As a CLI Tool

Install using Cargo:

cargo install omniparse

Or build from source:

git clone https://github.com/omniparse/omniparse
cd omniparse
cargo build --release

The binary will be available at target/release/omniparse.

Library Usage

Basic Extraction

use omniparse::extract_from_path;

fn main() -> Result<(), omniparse::Error> {
    // Extract from a file
    let result = extract_from_path("document.pdf")?;
    
    println!("MIME type: {}", result.mime_type);
    println!("Confidence: {:.2}", result.detection_confidence);
    
    // Access content
    if let omniparse::Content::Text(text) = result.content {
        println!("Text content: {}", text);
    }
    
    // Access metadata
    if let Some(title) = result.metadata.title() {
        println!("Title: {}", title);
    }
    if let Some(author) = result.metadata.author() {
        println!("Author: {}", author);
    }
    
    Ok(())
}

Extract from Bytes

use omniparse::extract_from_bytes;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = std::fs::read("file.json")?;
    
    // With automatic type detection
    let result = extract_from_bytes(&data, None)?;
    
    // Or with a MIME type hint
    let result = extract_from_bytes(&data, Some("application/json"))?;
    
    println!("Detected: {}", result.mime_type);
    Ok(())
}

Async Extraction

use omniparse::extract_from_path_async;

#[tokio::main]
async fn main() -> Result<(), omniparse::Error> {
    let result = extract_from_path_async("document.pdf").await?;
    println!("Extracted: {}", result.mime_type);
    Ok(())
}

Check Supported Formats

use omniparse::{supported_mime_types, is_mime_supported};

fn main() {
    // Get all supported MIME types
    let types = supported_mime_types();
    println!("Supported formats: {}", types.len());
    
    // Check if a specific format is supported
    if is_mime_supported("application/pdf") {
        println!("PDF is supported!");
    }
}

Batch Processing

use omniparse::core::Extractor;
use omniparse::utils::parallel::process_files_parallel;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let extractor = Extractor::new();
    let files = vec!["file1.pdf", "file2.docx", "file3.txt"];
    
    // Process files in parallel
    let results = process_files_parallel(&extractor, &files);
    
    for file_result in results {
        match file_result.result {
            Ok(extraction) => {
                println!("{}: {} (confidence: {:.2})",
                    file_result.path,
                    extraction.mime_type,
                    extraction.detection_confidence
                );
            }
            Err(e) => {
                eprintln!("{}: Error - {}", file_result.path, e);
            }
        }
    }
    
    Ok(())
}

CLI Usage

Basic Extraction

# Extract from a single file
omniparse document.pdf

# Extract from multiple files
omniparse file1.txt file2.docx file3.pdf

Output Formats

# JSON output
omniparse --format json document.pdf

# YAML output
omniparse --format yaml document.pdf

# Save to file
omniparse --output results.json --format json document.pdf

Metadata Only

# Extract only metadata, no content
omniparse --metadata-only document.pdf

Type Detection Only

# Detect file type without extraction
omniparse --detect-only unknown_file.bin

Parallel Processing

# Process multiple files in parallel
omniparse --parallel *.pdf

Verbose Output

# Enable verbose logging
omniparse --verbose file1.pdf file2.pdf file3.pdf

Combined Options

# Metadata only, JSON format, parallel processing
omniparse --metadata-only --format json --parallel --output metadata.json *.pdf

Format-Specific Examples

# Extract from HTML files (web pages)
omniparse webpage.html index.htm
omniparse --format json --metadata-only page.html

# Extract from CSS files (stylesheets)
omniparse styles.css theme.css
omniparse --format json stylesheet.css  # Get rule and selector counts

# Extract from RTF files (rich text)
omniparse document.rtf letter.rtf
omniparse --metadata-only report.rtf

# Extract from spreadsheets (Excel and OpenDocument)
omniparse data.xlsx spreadsheet.xls budget.ods
omniparse --format json --output data.json financial.xlsx
omniparse --parallel *.xlsx *.xls *.ods  # Process multiple spreadsheets

# Extract from presentations (PowerPoint and OpenDocument)
omniparse slides.pptx presentation.ppt deck.odp
omniparse --metadata-only quarterly-review.pptx  # Get slide count and metadata
omniparse --format json --output slides.json presentation.pptx

# Extract from legacy Office files (DOC, XLS, PPT)
omniparse document.doc old-report.doc
omniparse spreadsheet.xls data-2010.xls
omniparse presentation.ppt slides-archive.ppt

# Mixed format batch processing
omniparse --parallel --format json --output results.json *.html *.css *.rtf *.xlsx *.pptx

Error Handling

Omniparse provides detailed error types for different failure scenarios:

use omniparse::{extract_from_path, Error};

match extract_from_path("file.xyz") {
    Ok(result) => {
        println!("Success: {}", result.mime_type);
    }
    Err(Error::UnsupportedFormat(mime)) => {
        eprintln!("Format {} is not supported", mime);
    }
    Err(Error::Io(e)) => {
        eprintln!("IO error: {}", e);
    }
    Err(Error::CorruptedFile(msg)) => {
        eprintln!("File is corrupted: {}", msg);
    }
    Err(Error::PartialExtraction { message, partial_result }) => {
        eprintln!("Warning: {}", message);
        println!("Partial content available: {:?}", partial_result.content);
    }
    Err(e) => {
        eprintln!("Error: {}", e);
    }
}

New Format Support

Omniparse has recently added support for 9 additional document formats:

Web Formats

HTML: Extract visible text and metadata from web pages
CSS: Analyze stylesheets with rule and selector counting

Office Formats

XLSX/XLS: Extract data from Excel spreadsheets (modern and legacy)
PPTX/PPT: Extract text from PowerPoint presentations (modern and legacy)
DOC: Extract content from legacy Word documents

OpenDocument Formats

ODS: Extract data from OpenDocument spreadsheets
ODP: Extract text from OpenDocument presentations

Rich Text

RTF: Extract plain text from Rich Text Format files

See SUPPORTED_FORMATS.md for detailed information about each format.

Performance

Omniparse is designed for performance:

Streaming: Large files are processed using streaming to limit memory usage
Parallel Processing: Batch operations can leverage multiple CPU cores
Pure Rust: No FFI overhead or external process spawning
Efficient Detection: Magic byte detection is fast and accurate

Typical performance on standard hardware:

Text files (10 MB): < 100ms
HTML files (1 MB): < 100ms (actual: ~0.6ms)
PDF documents: 200-500ms depending on size
XLSX files (10K cells): < 500ms (actual: ~0.9ms for small files)
PPTX files (100 slides): < 1000ms (actual: ~0.6ms for small files)
Image metadata: < 50ms

All performance targets met or exceeded. See FINAL_PERFORMANCE_SUMMARY.md for comprehensive benchmark results.

Architecture

Omniparse follows a modular architecture:

┌─────────────────┐
│   CLI / API     │
└────────┬────────┘
         │
┌────────▼────────┐
│   Extractor     │
└────┬───────┬────┘
     │       │
┌────▼───┐ ┌▼──────────┐
│Detector│ │  Registry  │
└────────┘ └─────┬──────┘
                 │
         ┌───────┴───────┐
         │    Parsers    │
         ├───────────────┤
         │ Text          │
         │ Document      │
         │ Image         │
         │ Archive       │
         └───────────────┘

Extractor: Orchestrates detection and parsing
Detector: Identifies file types using multiple methods
Registry: Manages available parsers
Parsers: Format-specific extraction implementations

Documentation

Version 0.3 (current)

RELEASE_NOTES_v0.3.0.md - Complete list of v0.3.0 enhancements, feature flags, env var reference
MIGRATION_v0.3.0.md - Upgrade guide from v0.2.x with breaking change details
OCR_GUIDE.md - Full OCR subsystem guide: classical vs ML, training, tuning, debugging
CHANGELOG.md - Full changelog

General

SUPPORTED_FORMATS.md - Complete list of supported formats
examples/ - Working code examples for all formats and OCR modes
API Documentation - Run cargo doc --open --features "ocr-ml ocr-train" for full API docs

Historical

CLI_NEW_FORMATS_GUIDE.md - v0.2 CLI guide for initially-added formats
MIGRATION_GUIDE.md - v0.2 migration guide

Contributing

Contributions are welcome! Areas for contribution:

Adding support for new file formats
Improving type detection accuracy
Performance optimizations
Documentation improvements
Bug fixes

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Acknowledgments

Inspired by Apache Tika, the Java-based content analysis toolkit.

omniparse 0.3.0

Omniparse

Features

Supported Formats

Text Formats

Document Formats

Document Formats (added)

Image Formats

Audio Formats

Archive Formats

Installation

As a Library

Two OCR backends

ML OCR backend (ocr-ml)

As a CLI Tool

Library Usage

Basic Extraction

Extract from Bytes

Async Extraction

Check Supported Formats

Batch Processing

CLI Usage

Basic Extraction

Output Formats

Metadata Only

Type Detection Only

Parallel Processing

Verbose Output

Combined Options

Format-Specific Examples

Error Handling

New Format Support

Web Formats

Office Formats

OpenDocument Formats

Rich Text

Performance

Architecture

Documentation

Version 0.3 (current)

General

Historical

Contributing

License

Acknowledgments

ML OCR backend (`ocr-ml`)