omniparse 0.4.1

A Rust toolkit for detecting and extracting metadata, text, and content from various file formats
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
# Omniparse

A Rust toolkit for detecting and extracting metadata, text, and content from hundreds of different file formats. Omniparse provides both a command-line interface and a library API, serving as a Rust equivalent to Apache Tika.

## Features

- **Automatic Type Detection**: Identifies file types using magic bytes, content analysis, and extension fallback
- **Multiple Format Support**: Extracts content from 25+ formats across text, document, image, audio, and archive categories
- **Rich Metadata Extraction**: Full EXIF for JPEG/TIFF, OpenGraph / Twitter / canonical for HTML, ID3 for MP3, OPF for EPUB, version/encryption/forms/annotations for PDF, and more
- **OCR Subsystem**: Optional classical and ML OCR pipelines for images and scanned PDFs. Pure Rust. Models download on first use for the ML backend (or pre-fetch via `omniparse models download`); classical backend has no external dependencies.
- **Production Web Service**: Ship-ready Axum example (`examples/web_service_prod.rs`) with Cloud Logging JSON, Prometheus `/metrics`, liveness/readiness probes, request limits, panic catcher, graceful shutdown — baked into the published Docker image and one-command deployable to Google Cloud Run via `deploy/cloud-run/deploy.sh`.
- **Robust PDF parsing**: Four-tier fallback chain (strict via `lopdf` → trailing-junk repair → raw stream-byte scan with FlateDecode/LZWDecode/ASCII85Decode → optional `pdf-extract` for linearized / Identity-H + /ToUnicode CMap PDFs). Real-world inputs from Lucidchart, Word print-to-PDF, browser print-to-PDF, truncated downloads — all yield text instead of `"Invalid file trailer"`. A `pdf_parse_strategy` metadata field surfaces which tier ran.
- **Dual Interface**: Use as a CLI tool or integrate as a library in your Rust applications
- **Pure Rust Implementation**: Minimal dependencies, no external system libraries required
- **Async Support**: Optional async API for non-blocking operations
- **Parallel Processing**: Batch process multiple files in parallel for better performance
- **Streaming Support**: Memory-efficient processing of large files
- **Security Hardening**: ZIP-bomb detection, XML entity limits, archive path-traversal detection, strict prototype validation

## Supported Formats

### Text Formats
- Plain Text (TXT)
- JSON
- CSV/TSV
- XML
- HTML (OpenGraph, Twitter Card, canonical URL, viewport, heading counts)
- CSS
- RTF (Rich Text Format)
- Markdown (via `pulldown-cmark`, optional `markdown` feature, default on)

### Document Formats
- PDF
- Microsoft Word (DOCX, DOC)
- Microsoft Excel (XLSX, XLS)
- Microsoft PowerPoint (PPTX, PPT)
- OpenDocument Text (ODT)
- OpenDocument Spreadsheet (ODS)
- OpenDocument Presentation (ODP)

### Document Formats (added)
- EPUB (OPF metadata, spine walk, chapter text — optional `epub` feature)

### Image Formats
- JPEG (full EXIF via `kamadak-exif`, optional OCR)
- PNG (text chunks including decompressed zTXt/iTXt, optional OCR)
- TIFF (EXIF via shared helper, optional OCR)
- SVG (title, desc, viewBox, text nodes, element counts — optional `svg` feature)
- WebP (dimensions, EXIF, optional OCR — optional `webp` feature)

### Audio Formats
- MP3 (ID3v1/v2 tags — title, artist, album, genre, year, track, duration — optional `mp3` feature)

### Archive Formats
- ZIP (with path-traversal detection via `contains_unsafe_paths` metadata)
- TAR (with path-traversal detection)

## Installation

### As a Library

Add Omniparse to your `Cargo.toml`:

```toml
[dependencies]
omniparse = "0.4"
```

For async support:

```toml
[dependencies]
omniparse = { version = "0.4", features = ["async"] }
```

For parallel processing:

```toml
[dependencies]
omniparse = { version = "0.4", features = ["parallel"] }
```

For broader PDF coverage (Lucidchart / Word print-to-PDF / linearized
PDFs that the default lopdf-based tiers can't load):

```toml
[dependencies]
omniparse = { version = "0.4", features = ["pdf-extract"] }
```

To build without PDF support at all (smaller dependency tree):

```toml
[dependencies]
omniparse = { version = "0.4", default-features = false, features = ["markdown", "svg", "webp", "epub", "mp3"] }
```

Full feature reference: see [Cargo features](#cargo-features) in this
README or the table in `cargo doc --open`.

### Cargo features

| Feature        | Default | Purpose                                                                  |
| -------------- | ------- | ------------------------------------------------------------------------ |
| `pdf`          | **on**  | PDF parsing via `lopdf` + lenient raw_scan fallback (Flate/LZW/ASCII85)  |
| `pdf-extract`  | off     | 4th-tier PDF fallback via `pdf-extract` (linearized / Identity-H CMaps)  |
| `markdown`     | **on**  | Markdown parser                                                          |
| `svg`          | **on**  | SVG parser                                                               |
| `webp`         | **on**  | WebP parser                                                              |
| `epub`         | **on**  | EPUB parser                                                              |
| `mp3`          | **on**  | MP3 ID3v1/v2 parser                                                      |
| `async`        | off     | tokio-backed `extract_from_path_async`                                   |
| `parallel`     | off     | rayon-backed `process_files_parallel`                                    |
| `ocr`          | off     | Classical OCR pipeline (pure Rust, no external deps)                     |
| `ocr-ml`       | off     | ML OCR backend via `ocrs` + `rten`, models auto-downloaded               |
| `ocr-train`    | off     | TTF/OTF → prototype trainer for the classical OCR pipeline               |
| `ocr-parallel` | off     | Parallel per-region OCR recognition (implies `ocr` + `parallel`)         |

### OCR — quickstart

Two backends; one env var selects which runs.

```sh
# ML backend (recommended for photos / screenshots / unknown typography)
cargo install omniparse --features ocr-ml
omniparse models download              # one-time, ~12 MB to user cache
OMNIPARSE_OCR=ml omniparse photo.jpg

# Classical backend (pure Rust, no downloads — clean printed scans only)
cargo install omniparse --features ocr
OMNIPARSE_OCR=classical omniparse scan.png
```

Prefer a container? `docker run --rm -p 3000:3000 ghcr.io/sirhco/omniparse-web:latest`
launches the Axum web service with ML models baked in (see
[`Dockerfile`](Dockerfile) and [`examples/WEB_SERVICE_GUIDE.md`](examples/WEB_SERVICE_GUIDE.md)).

📖 **[Full OCR Guide →](OCR_GUIDE.md)** — backend chooser, model-cache CLI,
training custom prototypes, tuning, debugging, library API, FAQ.

### As a CLI Tool

Install using Cargo:

```bash
cargo install omniparse
```

Or build from source:

```bash
git clone https://github.com/omniparse/omniparse
cd omniparse
cargo build --release
```

The binary will be available at `target/release/omniparse`.

## Library Usage

### Basic Extraction

```rust
use omniparse::extract_from_path;

fn main() -> Result<(), omniparse::Error> {
    // Extract from a file
    let result = extract_from_path("document.pdf")?;
    
    println!("MIME type: {}", result.mime_type);
    println!("Confidence: {:.2}", result.detection_confidence);
    
    // Access content
    if let omniparse::Content::Text(text) = result.content {
        println!("Text content: {}", text);
    }
    
    // Access metadata
    if let Some(title) = result.metadata.title() {
        println!("Title: {}", title);
    }
    if let Some(author) = result.metadata.author() {
        println!("Author: {}", author);
    }
    
    Ok(())
}
```

### Extract from Bytes

```rust
use omniparse::extract_from_bytes;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = std::fs::read("file.json")?;
    
    // With automatic type detection
    let result = extract_from_bytes(&data, None)?;
    
    // Or with a MIME type hint
    let result = extract_from_bytes(&data, Some("application/json"))?;
    
    println!("Detected: {}", result.mime_type);
    Ok(())
}
```

### Async Extraction

```rust
use omniparse::extract_from_path_async;

#[tokio::main]
async fn main() -> Result<(), omniparse::Error> {
    let result = extract_from_path_async("document.pdf").await?;
    println!("Extracted: {}", result.mime_type);
    Ok(())
}
```

### Check Supported Formats

```rust
use omniparse::{supported_mime_types, is_mime_supported};

fn main() {
    // Get all supported MIME types
    let types = supported_mime_types();
    println!("Supported formats: {}", types.len());
    
    // Check if a specific format is supported
    if is_mime_supported("application/pdf") {
        println!("PDF is supported!");
    }
}
```

### Batch Processing

```rust
use omniparse::core::Extractor;
use omniparse::utils::parallel::process_files_parallel;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let extractor = Extractor::new();
    let files = vec!["file1.pdf", "file2.docx", "file3.txt"];
    
    // Process files in parallel
    let results = process_files_parallel(&extractor, &files);
    
    for file_result in results {
        match file_result.result {
            Ok(extraction) => {
                println!("{}: {} (confidence: {:.2})",
                    file_result.path,
                    extraction.mime_type,
                    extraction.detection_confidence
                );
            }
            Err(e) => {
                eprintln!("{}: Error - {}", file_result.path, e);
            }
        }
    }
    
    Ok(())
}
```

## CLI Usage

### Basic Extraction

```bash
# Extract from a single file
omniparse document.pdf

# Extract from multiple files
omniparse file1.txt file2.docx file3.pdf
```

### Output Formats

```bash
# JSON output
omniparse --format json document.pdf

# YAML output
omniparse --format yaml document.pdf

# Save to file
omniparse --output results.json --format json document.pdf
```

### Metadata Only

```bash
# Extract only metadata, no content
omniparse --metadata-only document.pdf
```

### Type Detection Only

```bash
# Detect file type without extraction
omniparse --detect-only unknown_file.bin
```

### Parallel Processing

```bash
# Process multiple files in parallel
omniparse --parallel *.pdf
```

### Verbose Output

```bash
# Enable verbose logging
omniparse --verbose file1.pdf file2.pdf file3.pdf
```

### Combined Options

```bash
# Metadata only, JSON format, parallel processing
omniparse --metadata-only --format json --parallel --output metadata.json *.pdf
```

### Format-Specific Examples

```bash
# Extract from HTML files (web pages)
omniparse webpage.html index.htm
omniparse --format json --metadata-only page.html

# Extract from CSS files (stylesheets)
omniparse styles.css theme.css
omniparse --format json stylesheet.css  # Get rule and selector counts

# Extract from RTF files (rich text)
omniparse document.rtf letter.rtf
omniparse --metadata-only report.rtf

# Extract from spreadsheets (Excel and OpenDocument)
omniparse data.xlsx spreadsheet.xls budget.ods
omniparse --format json --output data.json financial.xlsx
omniparse --parallel *.xlsx *.xls *.ods  # Process multiple spreadsheets

# Extract from presentations (PowerPoint and OpenDocument)
omniparse slides.pptx presentation.ppt deck.odp
omniparse --metadata-only quarterly-review.pptx  # Get slide count and metadata
omniparse --format json --output slides.json presentation.pptx

# Extract from legacy Office files (DOC, XLS, PPT)
omniparse document.doc old-report.doc
omniparse spreadsheet.xls data-2010.xls
omniparse presentation.ppt slides-archive.ppt

# Mixed format batch processing
omniparse --parallel --format json --output results.json *.html *.css *.rtf *.xlsx *.pptx
```

## Error Handling

Omniparse provides detailed error types for different failure scenarios:

```rust
use omniparse::{extract_from_path, Error};

match extract_from_path("file.xyz") {
    Ok(result) => {
        println!("Success: {}", result.mime_type);
    }
    Err(Error::UnsupportedFormat(mime)) => {
        eprintln!("Format {} is not supported", mime);
    }
    Err(Error::Io(e)) => {
        eprintln!("IO error: {}", e);
    }
    Err(Error::CorruptedFile(msg)) => {
        eprintln!("File is corrupted: {}", msg);
    }
    Err(Error::PartialExtraction { message, partial_result }) => {
        eprintln!("Warning: {}", message);
        println!("Partial content available: {:?}", partial_result.content);
    }
    Err(e) => {
        eprintln!("Error: {}", e);
    }
}
```

## New Format Support

Omniparse has recently added support for 9 additional document formats:

### Web Formats
- **HTML**: Extract visible text and metadata from web pages
- **CSS**: Analyze stylesheets with rule and selector counting

### Office Formats
- **XLSX/XLS**: Extract data from Excel spreadsheets (modern and legacy)
- **PPTX/PPT**: Extract text from PowerPoint presentations (modern and legacy)
- **DOC**: Extract content from legacy Word documents

### OpenDocument Formats
- **ODS**: Extract data from OpenDocument spreadsheets
- **ODP**: Extract text from OpenDocument presentations

### Rich Text
- **RTF**: Extract plain text from Rich Text Format files

See [SUPPORTED_FORMATS.md](SUPPORTED_FORMATS.md) for detailed information about each format.

## Performance

Omniparse is designed for performance:

- **Streaming**: Large files are processed using streaming to limit memory usage
- **Parallel Processing**: Batch operations can leverage multiple CPU cores
- **Pure Rust**: No FFI overhead or external process spawning
- **Efficient Detection**: Magic byte detection is fast and accurate

Typical performance on standard hardware:
- Text files (10 MB): < 100ms
- HTML files (1 MB): < 100ms (actual: ~0.6ms)
- PDF documents: 200-500ms depending on size
- XLSX files (10K cells): < 500ms (actual: ~0.9ms for small files)
- PPTX files (100 slides): < 1000ms (actual: ~0.6ms for small files)
- Image metadata: < 50ms

**All performance targets met or exceeded.** See [FINAL_PERFORMANCE_SUMMARY.md](FINAL_PERFORMANCE_SUMMARY.md) for comprehensive benchmark results.

## Architecture

Omniparse follows a modular architecture:

```
┌─────────────────┐
│   CLI / API     │
└────────┬────────┘
┌────────▼────────┐
│   Extractor     │
└────┬───────┬────┘
     │       │
┌────▼───┐ ┌▼──────────┐
│Detector│ │  Registry  │
└────────┘ └─────┬──────┘
         ┌───────┴───────┐
         │    Parsers    │
         ├───────────────┤
         │ Text          │
         │ Document      │
         │ Image         │
         │ Archive       │
         └───────────────┘
```

- **Extractor**: Orchestrates detection and parsing
- **Detector**: Identifies file types using multiple methods
- **Registry**: Manages available parsers
- **Parsers**: Format-specific extraction implementations

## Documentation

### Version 0.4 (current)

- **[RELEASE_NOTES_v0.4.0.md]RELEASE_NOTES_v0.4.0.md** - `omniparse models` CLI, unified `OMNIPARSE_OCR` env var, Dockerfile + GHCR image, production Cloud Run example
- **[OCR_GUIDE.md]OCR_GUIDE.md** - Single canonical OCR reference: backend chooser, model-cache CLI, training, tuning, debugging
- **[examples/WEB_SERVICE_GUIDE.md]examples/WEB_SERVICE_GUIDE.md** - Web service guide: minimal demo + production example + Cloud Run deploy
- **[CHANGELOG.md]CHANGELOG.md** - Full changelog

### Version 0.3

- **[RELEASE_NOTES_v0.3.0.md]RELEASE_NOTES_v0.3.0.md** - v0.3.0 enhancements, feature flags, env var reference
- **[MIGRATION_v0.3.0.md]MIGRATION_v0.3.0.md** - Upgrade guide from v0.2.x

### General

- **[SUPPORTED_FORMATS.md]SUPPORTED_FORMATS.md** - Complete list of supported formats
- **[examples/]examples/** - Working code examples for all formats and OCR modes
- **API Documentation** - Run `cargo doc --open --features "ocr-ml ocr-train"` for full API docs

### Historical

- **[CLI_NEW_FORMATS_GUIDE.md]CLI_NEW_FORMATS_GUIDE.md** - v0.2 CLI guide for initially-added formats
- **[MIGRATION_GUIDE.md]MIGRATION_GUIDE.md** - v0.2 migration guide

## Contributing

Contributions are welcome! Areas for contribution:

- Adding support for new file formats
- Improving type detection accuracy
- Performance optimizations
- Documentation improvements
- Bug fixes

## License

Licensed under either of:

- Apache License, Version 2.0 ([LICENSE-APACHE]LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license ([LICENSE-MIT]LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

## Acknowledgments

Inspired by [Apache Tika](https://tika.apache.org/), the Java-based content analysis toolkit.

### Core dependencies

Pure-Rust crates carrying the heavy lifting. Every license below is permissive and compatible with omniparse's MIT/Apache-2.0 dual license. The only copyleft is `cssparser`'s **MPL-2.0**, which is weak/file-level — it covers only that crate's own files and does not affect omniparse's license. A `deny.toml` policy enforces this (no GPL/AGPL allowed); see [Dependency licensing](#dependency-licensing) below.

| Crate                                                       | Used for                                                                | License             |
| ----------------------------------------------------------- | ----------------------------------------------------------------------- | ------------------- |
| [`lopdf`]https://crates.io/crates/lopdf                   | Strict-tier PDF parsing (xref / trailer / object dictionary, embedded-image extraction for OCR)            | MIT                 |
| [`pdf-extract`]https://crates.io/crates/pdf-extract       | 4th-tier PDF fallback for linearized / Identity-H + /ToUnicode CMap PDFs (Lucidchart, Word print-to-PDF). Behind the `pdf-extract` feature | MIT                 |
| [`weezl`]https://crates.io/crates/weezl                   | LZWDecode stream filter in the raw_scan PDF fallback                    | MIT / Apache-2.0    |
| [`ascii85`]https://crates.io/crates/ascii85               | ASCII85Decode stream filter in the raw_scan PDF fallback                | MIT / Apache-2.0    |
| [`ocrs`]https://crates.io/crates/ocrs + [`rten`]https://crates.io/crates/rten | ML OCR backend (text-detection + text-recognition models)         | MIT / Apache-2.0    |
| [`image`]https://crates.io/crates/image                   | Image decode                                                            | MIT / Apache-2.0    |
| [`kamadak-exif`]https://crates.io/crates/kamadak-exif     | EXIF metadata                                                           | BSD-2-Clause        |
| [`calamine`]https://crates.io/crates/calamine             | XLSX / XLS / ODS parsing                                                | MIT                 |
| [`scraper`]https://crates.io/crates/scraper + [`cssparser`]https://crates.io/crates/cssparser | HTML + CSS parsing                                              | ISC (scraper) / MPL-2.0 (cssparser) |
| [`rbook`]https://crates.io/crates/rbook                   | EPUB 2/3 OPF metadata + reading-order text                             | Apache-2.0          |
| [`id3`]https://crates.io/crates/id3                       | MP3 ID3v1/v2 tags                                                       | MIT                 |
| [`zip`]https://crates.io/crates/zip                       | ZIP / Office / EPUB container walking                                   | MIT                 |
| [`tar`]https://crates.io/crates/tar, [`flate2`]https://crates.io/crates/flate2 | TAR walking + deflate                                           | MIT / Apache-2.0    |

### Dependency licensing

omniparse is `MIT OR Apache-2.0`. To keep the dependency tree free of
unexpected copyleft, the repo ships a [`deny.toml`](deny.toml) policy
enforced in CI ([`cargo deny`](https://github.com/EmbarkStudios/cargo-deny)):

- Only permissive licenses are allowed; **GPL / AGPL / standalone LGPL are
  rejected.** A dependency carrying one fails the build.
- Weak file-level **MPL-2.0** (e.g. `cssparser`) is allowed only for the
  specific crates known to use it, via scoped exceptions — a *new* copyleft
  dependency still trips the check.
- The check runs with all features enabled, so optional stacks (PDF, OCR)
  are covered too.

This was added after the `epub` crate (GPL-3.0) was found shipping in the
default feature set; it has since been replaced by `rbook` (Apache-2.0).