omniparse 0.3.0

A Rust toolkit for detecting and extracting metadata, text, and content from various file formats
Documentation

Omniparse

A Rust toolkit for detecting and extracting metadata, text, and content from hundreds of different file formats. Omniparse provides both a command-line interface and a library API, serving as a Rust equivalent to Apache Tika.

Features

  • Automatic Type Detection: Identifies file types using magic bytes, content analysis, and extension fallback
  • Multiple Format Support: Extracts content from 25+ formats across text, document, image, audio, and archive categories
  • Rich Metadata Extraction: Full EXIF for JPEG/TIFF, OpenGraph / Twitter / canonical for HTML, ID3 for MP3, OPF for EPUB, version/encryption/forms/annotations for PDF, and more
  • OCR Subsystem (v0.3): Optional classical and ML OCR pipelines for images and scanned PDFs. Pure Rust. Models download on first use for the ML backend; classical backend has no external dependencies.
  • Dual Interface: Use as a CLI tool or integrate as a library in your Rust applications
  • Pure Rust Implementation: Minimal dependencies, no external system libraries required
  • Async Support: Optional async API for non-blocking operations
  • Parallel Processing: Batch process multiple files in parallel for better performance
  • Streaming Support: Memory-efficient processing of large files
  • Security Hardening: ZIP-bomb detection, XML entity limits, archive path-traversal detection, strict prototype validation

Supported Formats

Text Formats

  • Plain Text (TXT)
  • JSON
  • CSV/TSV
  • XML
  • HTML (OpenGraph, Twitter Card, canonical URL, viewport, heading counts)
  • CSS
  • RTF (Rich Text Format)
  • Markdown (via pulldown-cmark, optional markdown feature, default on)

Document Formats

  • PDF
  • Microsoft Word (DOCX, DOC)
  • Microsoft Excel (XLSX, XLS)
  • Microsoft PowerPoint (PPTX, PPT)
  • OpenDocument Text (ODT)
  • OpenDocument Spreadsheet (ODS)
  • OpenDocument Presentation (ODP)

Document Formats (added)

  • EPUB (OPF metadata, spine walk, chapter text — optional epub feature)

Image Formats

  • JPEG (full EXIF via kamadak-exif, optional OCR)
  • PNG (text chunks including decompressed zTXt/iTXt, optional OCR)
  • TIFF (EXIF via shared helper, optional OCR)
  • SVG (title, desc, viewBox, text nodes, element counts — optional svg feature)
  • WebP (dimensions, EXIF, optional OCR — optional webp feature)

Audio Formats

  • MP3 (ID3v1/v2 tags — title, artist, album, genre, year, track, duration — optional mp3 feature)

Archive Formats

  • ZIP (with path-traversal detection via contains_unsafe_paths metadata)
  • TAR (with path-traversal detection)

Installation

As a Library

Add Omniparse to your Cargo.toml:

[dependencies]
omniparse = "0.3"

For async support:

[dependencies]
omniparse = { version = "0.3", features = ["async"] }

For parallel processing:

[dependencies]
omniparse = { version = "0.3", features = ["parallel"] }

Two OCR backends

v0.3 ships two optional OCR backends. Pick one based on your inputs.

📖 Full OCR Guide → — training, tuning, debugging, API examples.

Classical — pure-algorithm pipeline. No ML runtime, no downloads.

[dependencies]
omniparse = { version = "0.3", features = ["ocr"] }

OCR is runtime-opt-in — set OMNIPARSE_OCR=1 (or configure the engine explicitly) to activate it. See examples/ocr_basic.rs.

The bundled recognizer ships with 7×9 bitmap prototypes suitable only for matching clean synthetic text. For real-world photos or documents, train a prototype set from the actual typeface using the ocr-train feature:

[dependencies]
omniparse = { version = "0.3", features = ["ocr-train"] }
# generate prototypes from a font at a specific pixel size
cargo run --features ocr-train --example train_prototypes -- \
    /path/to/Font.ttf prototypes.json 48

# use them at runtime
OMNIPARSE_OCR=1 OMNIPARSE_OCR_PROTOTYPES=prototypes.json \
    cargo run --features ocr --release -- image.jpg

Tune OMNIPARSE_OCR_MIN_CONFIDENCE=<0.0..=1.0> to trade noise for recall (default 0.15).

For photographs where text is overlaid on images, switch the layout analyzer to the Stroke-Width Transform:

OMNIPARSE_OCR=1 OMNIPARSE_OCR_LAYOUT=swt \
    OMNIPARSE_OCR_PROTOTYPES=prototypes.json \
    cargo run --features ocr --release -- photo.jpg

Multi-scale training improves recognition across different rendered sizes:

cargo run --features ocr-train --example train_prototypes -- \
    /path/to/Font.ttf prototypes.json 24,48,96

ML OCR backend (ocr-ml)

For photographic inputs where the classical pipeline's shape-feature classifier can't recover text, enable the ML backend:

[dependencies]
omniparse = { version = "0.3", features = ["ocr-ml"] }
OMNIPARSE_OCR=1 OMNIPARSE_OCR_ML=1 \
    cargo run --features ocr-ml --release -- photo.jpg

Uses ocrs + rten (both pure Rust, MIT). Pre-trained detection + recognition models download once (~30 MB) to the user cache directory. Override the cache location with OMNIPARSE_OCR_MODELS=<path>. No models are bundled in the crate.

As a CLI Tool

Install using Cargo:

cargo install omniparse

Or build from source:

git clone https://github.com/omniparse/omniparse
cd omniparse
cargo build --release

The binary will be available at target/release/omniparse.

Library Usage

Basic Extraction

use omniparse::extract_from_path;

fn main() -> Result<(), omniparse::Error> {
    // Extract from a file
    let result = extract_from_path("document.pdf")?;
    
    println!("MIME type: {}", result.mime_type);
    println!("Confidence: {:.2}", result.detection_confidence);
    
    // Access content
    if let omniparse::Content::Text(text) = result.content {
        println!("Text content: {}", text);
    }
    
    // Access metadata
    if let Some(title) = result.metadata.title() {
        println!("Title: {}", title);
    }
    if let Some(author) = result.metadata.author() {
        println!("Author: {}", author);
    }
    
    Ok(())
}

Extract from Bytes

use omniparse::extract_from_bytes;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = std::fs::read("file.json")?;
    
    // With automatic type detection
    let result = extract_from_bytes(&data, None)?;
    
    // Or with a MIME type hint
    let result = extract_from_bytes(&data, Some("application/json"))?;
    
    println!("Detected: {}", result.mime_type);
    Ok(())
}

Async Extraction

use omniparse::extract_from_path_async;

#[tokio::main]
async fn main() -> Result<(), omniparse::Error> {
    let result = extract_from_path_async("document.pdf").await?;
    println!("Extracted: {}", result.mime_type);
    Ok(())
}

Check Supported Formats

use omniparse::{supported_mime_types, is_mime_supported};

fn main() {
    // Get all supported MIME types
    let types = supported_mime_types();
    println!("Supported formats: {}", types.len());
    
    // Check if a specific format is supported
    if is_mime_supported("application/pdf") {
        println!("PDF is supported!");
    }
}

Batch Processing

use omniparse::core::Extractor;
use omniparse::utils::parallel::process_files_parallel;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let extractor = Extractor::new();
    let files = vec!["file1.pdf", "file2.docx", "file3.txt"];
    
    // Process files in parallel
    let results = process_files_parallel(&extractor, &files);
    
    for file_result in results {
        match file_result.result {
            Ok(extraction) => {
                println!("{}: {} (confidence: {:.2})",
                    file_result.path,
                    extraction.mime_type,
                    extraction.detection_confidence
                );
            }
            Err(e) => {
                eprintln!("{}: Error - {}", file_result.path, e);
            }
        }
    }
    
    Ok(())
}

CLI Usage

Basic Extraction

# Extract from a single file
omniparse document.pdf

# Extract from multiple files
omniparse file1.txt file2.docx file3.pdf

Output Formats

# JSON output
omniparse --format json document.pdf

# YAML output
omniparse --format yaml document.pdf

# Save to file
omniparse --output results.json --format json document.pdf

Metadata Only

# Extract only metadata, no content
omniparse --metadata-only document.pdf

Type Detection Only

# Detect file type without extraction
omniparse --detect-only unknown_file.bin

Parallel Processing

# Process multiple files in parallel
omniparse --parallel *.pdf

Verbose Output

# Enable verbose logging
omniparse --verbose file1.pdf file2.pdf file3.pdf

Combined Options

# Metadata only, JSON format, parallel processing
omniparse --metadata-only --format json --parallel --output metadata.json *.pdf

Format-Specific Examples

# Extract from HTML files (web pages)
omniparse webpage.html index.htm
omniparse --format json --metadata-only page.html

# Extract from CSS files (stylesheets)
omniparse styles.css theme.css
omniparse --format json stylesheet.css  # Get rule and selector counts

# Extract from RTF files (rich text)
omniparse document.rtf letter.rtf
omniparse --metadata-only report.rtf

# Extract from spreadsheets (Excel and OpenDocument)
omniparse data.xlsx spreadsheet.xls budget.ods
omniparse --format json --output data.json financial.xlsx
omniparse --parallel *.xlsx *.xls *.ods  # Process multiple spreadsheets

# Extract from presentations (PowerPoint and OpenDocument)
omniparse slides.pptx presentation.ppt deck.odp
omniparse --metadata-only quarterly-review.pptx  # Get slide count and metadata
omniparse --format json --output slides.json presentation.pptx

# Extract from legacy Office files (DOC, XLS, PPT)
omniparse document.doc old-report.doc
omniparse spreadsheet.xls data-2010.xls
omniparse presentation.ppt slides-archive.ppt

# Mixed format batch processing
omniparse --parallel --format json --output results.json *.html *.css *.rtf *.xlsx *.pptx

Error Handling

Omniparse provides detailed error types for different failure scenarios:

use omniparse::{extract_from_path, Error};

match extract_from_path("file.xyz") {
    Ok(result) => {
        println!("Success: {}", result.mime_type);
    }
    Err(Error::UnsupportedFormat(mime)) => {
        eprintln!("Format {} is not supported", mime);
    }
    Err(Error::Io(e)) => {
        eprintln!("IO error: {}", e);
    }
    Err(Error::CorruptedFile(msg)) => {
        eprintln!("File is corrupted: {}", msg);
    }
    Err(Error::PartialExtraction { message, partial_result }) => {
        eprintln!("Warning: {}", message);
        println!("Partial content available: {:?}", partial_result.content);
    }
    Err(e) => {
        eprintln!("Error: {}", e);
    }
}

New Format Support

Omniparse has recently added support for 9 additional document formats:

Web Formats

  • HTML: Extract visible text and metadata from web pages
  • CSS: Analyze stylesheets with rule and selector counting

Office Formats

  • XLSX/XLS: Extract data from Excel spreadsheets (modern and legacy)
  • PPTX/PPT: Extract text from PowerPoint presentations (modern and legacy)
  • DOC: Extract content from legacy Word documents

OpenDocument Formats

  • ODS: Extract data from OpenDocument spreadsheets
  • ODP: Extract text from OpenDocument presentations

Rich Text

  • RTF: Extract plain text from Rich Text Format files

See SUPPORTED_FORMATS.md for detailed information about each format.

Performance

Omniparse is designed for performance:

  • Streaming: Large files are processed using streaming to limit memory usage
  • Parallel Processing: Batch operations can leverage multiple CPU cores
  • Pure Rust: No FFI overhead or external process spawning
  • Efficient Detection: Magic byte detection is fast and accurate

Typical performance on standard hardware:

  • Text files (10 MB): < 100ms
  • HTML files (1 MB): < 100ms (actual: ~0.6ms)
  • PDF documents: 200-500ms depending on size
  • XLSX files (10K cells): < 500ms (actual: ~0.9ms for small files)
  • PPTX files (100 slides): < 1000ms (actual: ~0.6ms for small files)
  • Image metadata: < 50ms

All performance targets met or exceeded. See FINAL_PERFORMANCE_SUMMARY.md for comprehensive benchmark results.

Architecture

Omniparse follows a modular architecture:

┌─────────────────┐
│   CLI / API     │
└────────┬────────┘
         │
┌────────▼────────┐
│   Extractor     │
└────┬───────┬────┘
     │       │
┌────▼───┐ ┌▼──────────┐
│Detector│ │  Registry  │
└────────┘ └─────┬──────┘
                 │
         ┌───────┴───────┐
         │    Parsers    │
         ├───────────────┤
         │ Text          │
         │ Document      │
         │ Image         │
         │ Archive       │
         └───────────────┘
  • Extractor: Orchestrates detection and parsing
  • Detector: Identifies file types using multiple methods
  • Registry: Manages available parsers
  • Parsers: Format-specific extraction implementations

Documentation

Version 0.3 (current)

General

  • SUPPORTED_FORMATS.md - Complete list of supported formats
  • examples/ - Working code examples for all formats and OCR modes
  • API Documentation - Run cargo doc --open --features "ocr-ml ocr-train" for full API docs

Historical

Contributing

Contributions are welcome! Areas for contribution:

  • Adding support for new file formats
  • Improving type detection accuracy
  • Performance optimizations
  • Documentation improvements
  • Bug fixes

License

Licensed under either of:

at your option.

Acknowledgments

Inspired by Apache Tika, the Java-based content analysis toolkit.