omniparse 0.4.1

A Rust toolkit for detecting and extracting metadata, text, and content from various file formats
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
//! # Omniparse — Rust content extraction toolkit
//!
//! Apache-Tika-style detection and extraction for 25+ file formats. Pure
//! Rust, no system libraries, optional async / parallel / OCR.
//!
//! ## Supported formats
//!
//! - **Text**: Plain text, JSON, CSV/TSV, XML, HTML (OpenGraph, Twitter,
//!   canonical URL, heading counts), CSS, RTF, Markdown
//! - **Documents**: PDF (version, encryption, form-field / annotation /
//!   attachment counts), DOCX, DOC, XLSX, XLS, PPTX, PPT, ODT, ODS, ODP,
//!   EPUB
//! - **Images**: JPEG (full EXIF), PNG (decompressed zTXt/iTXt), TIFF,
//!   SVG, WebP. Optional OCR routes image text to `Content::Text`.
//! - **Audio**: MP3 (ID3v1/v2)
//! - **Archives**: ZIP, TAR (with path-traversal detection)
//!
//! See [`SUPPORTED_FORMATS.md`] for per-format metadata keys.
//!
//! ## Cargo features
//!
//! | Feature         | Default | Purpose                               |
//! | --------------- | ------- | ------------------------------------- |
//! | `async`         | off     | Tokio-based async extraction          |
//! | `parallel`      | off     | Rayon-based batch processing          |
//! | `markdown`      | **on**  | Markdown parser                       |
//! | `svg`           | **on**  | SVG parser                            |
//! | `webp`          | **on**  | WebP parser                           |
//! | `epub`          | **on**  | EPUB parser                           |
//! | `mp3`           | **on**  | MP3 parser                            |
//! | `pdf`           | **on**  | PDF parser via `lopdf` + lenient fallback (`weezl` / `ascii85`) |
//! | `pdf-extract`   | off     | 4th-tier PDF fallback via `pdf-extract` (linearized / Identity-H PDFs) |
//! | `ocr`           | off     | Classical OCR pipeline                |
//! | `ocr-train`     | off     | TTF → prototype trainer               |
//! | `ocr-parallel`  | off     | Parallel per-region recognition       |
//! | `ocr-ml`        | off     | ML OCR backend (ocrs + rten)          |
//!
//! ## Acknowledgments
//!
//! Omniparse stands on the shoulders of several pure-Rust libraries. The
//! PDF tier specifically uses:
//!
//! - [`lopdf`](https://crates.io/crates/lopdf) — strict-tier PDF parser
//!   (xref / trailer / object dictionary parse, embedded-image extraction
//!   for the OCR path). MIT licensed.
//! - [`weezl`](https://crates.io/crates/weezl) — LZWDecode filter
//!   support in the raw_scan fallback. MIT/Apache-2.0.
//! - [`ascii85`](https://crates.io/crates/ascii85) — ASCII85Decode filter
//!   support in the raw_scan fallback. MIT/Apache-2.0.
//! - [`pdf-extract`](https://crates.io/crates/pdf-extract) (optional, behind
//!   the `pdf-extract` feature) — 4th-tier text extraction for PDFs that
//!   lopdf can't load. MIT licensed.
//!
//! EPUB support uses:
//!
//! - [`rbook`](https://crates.io/crates/rbook) (behind the `epub` feature) —
//!   EPUB 2/3 OPF metadata + reading-order text extraction. Apache-2.0.
//!
//! See `Cargo.toml` for the full dependency tree and per-crate version
//! pins. A `deny.toml` policy (enforced in CI via `cargo deny`) keeps the
//! dependency tree free of GPL/AGPL copyleft.
//!
//! ## PDF parsing tiers
//!
//! Real-world PDFs are messy — truncated downloads, linearized exports,
//! Identity-H + /ToUnicode CMaps, and appended HTTP-chunk garbage all
//! defeat strict parsers. Omniparse's PDF parser is a four-tier fallback
//! chain so the caller almost always gets text:
//!
//! 1. **strict** — `lopdf::Document::load_mem`. Full metadata + per-page
//!    text. Most well-formed PDFs.
//! 2. **repaired_xref** — truncate trailing bytes after the last `%%EOF`,
//!    retry strict load. Catches HTTP-chunk leftovers / double-`%%EOF`.
//! 3. **raw_scan** — walk `stream`/`endstream` byte ranges, decode
//!    FlateDecode / LZWDecode / ASCII85Decode / uncompressed payloads,
//!    regex-extract `Tj` / `TJ` operators. Recovers text from PDFs
//!    lopdf can't load. Output gated by a "looks-like-text" heuristic
//!    so glyph-index / encrypted bytes don't reach the caller.
//! 4. **pdf_extract** (only with `--features pdf-extract`) — re-parse via
//!    [`pdf-extract`](https://crates.io/crates/pdf-extract). Tolerates
//!    linearized PDFs + Identity-H + /ToUnicode CMaps (Lucidchart, Word
//!    print-to-PDF, browser print-to-PDF).
//!
//! Every successful response carries a `pdf_parse_strategy` metadata
//! field (`"strict"` / `"repaired_xref"` / `"raw_scan"` / `"pdf_extract"`).
//! Tiers 2–4 also set `pdf_parse_partial = true` and
//! `pdf_parse_error = "<original lopdf error>"`. Tier 4 is the most
//! important opt-in for shops processing Lucidchart or Word-print
//! exports.
//!
//! ```no_run
//! # #[cfg(feature = "pdf")] {
//! let result = omniparse::extract_from_path("document.pdf")?;
//! if let Some(strategy) = result.metadata.get("pdf_parse_strategy") {
//!     println!("PDF parsed via tier: {strategy:?}");
//! }
//! # }
//! # Ok::<(), omniparse::Error>(())
//! ```
//!
//! ## Web service example
//!
//! See `examples/web_service_prod.rs` for a Cloud Run-ready Axum service
//! that wraps this library: Cloud Logging JSON output, Prometheus
//! `/metrics`, `/live` + `/ready` probes, body-size + timeout +
//! concurrency limits, panic catcher, graceful shutdown, and a
//! `--healthcheck` mode for distroless containers. The published
//! Docker image uses this binary as its `ENTRYPOINT`.
//!
//! ## Quickstart
//!
//! ```no_run
//! use omniparse::extract_from_path;
//!
//! let result = extract_from_path("document.pdf")?;
//! println!("MIME type: {}", result.mime_type);
//! println!("Content: {:?}", result.content);
//! # Ok::<(), omniparse::Error>(())
//! ```
//!
//! ## Extract from HTML
//!
//! ```no_run
//! use omniparse::extract_from_path;
//!
//! let result = extract_from_path("webpage.html")?;
//! if let Some(title) = result.metadata.get("title") {
//!     println!("Page title: {:?}", title);
//! }
//! // v0.3: OpenGraph, Twitter, canonical URL, heading counts also available.
//! if let Some(og_title) = result.metadata.get("og_title") {
//!     println!("og:title = {:?}", og_title);
//! }
//! # Ok::<(), omniparse::Error>(())
//! ```
//!
//! ## Extract from spreadsheets
//!
//! ```no_run
//! use omniparse::extract_from_path;
//!
//! // Works with XLSX, XLS, and ODS
//! let result = extract_from_path("data.xlsx")?;
//! if let Some(sheet_count) = result.metadata.get("sheet_count") {
//!     println!("Number of sheets: {:?}", sheet_count);
//! }
//! # Ok::<(), omniparse::Error>(())
//! ```
//!
//! ## Extract from bytes with MIME type hint
//!
//! ```no_run
//! use omniparse::extract_from_bytes;
//!
//! let data = std::fs::read("file.json")?;
//! let result = extract_from_bytes(&data, Some("application/json"))?;
//! # Ok::<(), omniparse::Error>(())
//! ```
//!
//! ## Check supported formats
//!
//! ```
//! use omniparse::{supported_mime_types, is_mime_supported};
//!
//! let types = supported_mime_types();
//! println!("Supported types: {:?}", types);
//!
//! if is_mime_supported("application/pdf") {
//!     println!("PDF is supported!");
//! }
//! ```
//!
//! ## OCR
//!
//! Off by default. One env var selects the backend at runtime:
//!
//! - `OMNIPARSE_OCR=classical` — pure-Rust classical pipeline (`ocr` feature)
//! - `OMNIPARSE_OCR=ml` — ML backend via `ocrs` + `rten` (`ocr-ml` feature)
//! - `OMNIPARSE_OCR=off` / unset — OCR disabled (image parsers extract EXIF only)
//!
//! Image and PDF parsers automatically route through OCR when the gate is
//! set and populate `ocr_status` / `ocr_confidence` / `ocr_applied`
//! metadata.
//!
//! ```no_run
//! # #[cfg(feature = "ocr")] {
//! // OMNIPARSE_OCR=classical (or =ml) activates OCR for image parsers.
//! let result = omniparse::extract_from_path("photo.jpg")?;
//! if let Some(status) = result.metadata.get("ocr_status") {
//!     println!("ocr_status = {status:?}");
//! }
//! # }
//! # Ok::<(), omniparse::Error>(())
//! ```
//!
//! Direct library use of the classical engine:
//!
//! ```no_run
//! # #[cfg(feature = "ocr")] {
//! use omniparse::ocr::OcrEngine;
//! let engine = OcrEngine::new();
//! let image = image::open("page.png").unwrap();
//! let output = engine.recognize(image)?;
//! println!("{}", output.text);
//! # }
//! # Ok::<(), omniparse::Error>(())
//! ```
//!
//! ML backend (requires `ocr-ml` feature; pre-trained models are downloaded
//! and SHA-256-verified on first use, or pre-fetched via the CLI
//! `omniparse models download`):
//!
//! ```no_run
//! # #[cfg(feature = "ocr-ml")] {
//! let engine = omniparse::ocr::ml::MlOcrEngine::new()?;
//! let image = image::open("photo.jpg").unwrap();
//! let output = engine.recognize(image)?;
//! println!("{}", output.text);
//! # }
//! # Ok::<(), omniparse::Error>(())
//! ```
//!
//! See [`OCR_GUIDE.md`] for the model-cache CLI, training custom
//! prototypes, tuning, debugging, and the full env-var reference.
//!
//! [`SUPPORTED_FORMATS.md`]: https://github.com/sirhco/omniparse/blob/main/SUPPORTED_FORMATS.md
//! [`OCR_GUIDE.md`]: https://github.com/sirhco/omniparse/blob/main/OCR_GUIDE.md

pub mod core;
pub mod detection;
#[cfg(feature = "ocr")]
pub mod ocr;
pub mod parsers;
pub mod utils;

use std::path::Path;

// Re-export core types for convenience
pub use core::{Error, Result};
pub use core::result::{Content, ExtractionResult, Metadata, MetadataValue};

/// Extract text and metadata from a file at the specified path.
///
/// This function automatically detects the file type using magic bytes and content analysis,
/// then routes the file to the appropriate parser for extraction.
///
/// # Arguments
///
/// * `path` - Path to the file to extract content from
///
/// # Returns
///
/// Returns an `ExtractionResult` containing:
/// - The detected MIME type
/// - Extracted content (text or binary)
/// - Metadata fields (title, author, dates, etc.)
/// - Detection confidence score
///
/// # Errors
///
/// Returns an error if:
/// - The file cannot be read (IO error)
/// - The file format is not supported
/// - The file is corrupted or malformed
/// - Parsing fails for any reason
///
/// # Examples
///
/// ```no_run
/// use omniparse::extract_from_path;
///
/// let result = extract_from_path("document.pdf")?;
/// println!("Detected type: {}", result.mime_type);
/// 
/// if let omniparse::Content::Text(text) = result.content {
///     println!("Extracted text: {}", text);
/// }
///
/// if let Some(title) = result.metadata.title() {
///     println!("Title: {}", title);
/// }
/// # Ok::<(), omniparse::Error>(())
/// ```
pub fn extract_from_path(path: impl AsRef<Path>) -> Result<ExtractionResult> {
    let extractor = core::Extractor::new();
    extractor.extract_from_path(path)
}

/// Extract text and metadata from a byte slice.
///
/// This function allows extraction from in-memory data. You can optionally provide
/// a MIME type hint to skip type detection and use a specific parser directly.
///
/// # Arguments
///
/// * `data` - Byte slice containing the file data
/// * `mime_hint` - Optional MIME type hint (e.g., "application/pdf")
///
/// # Returns
///
/// Returns an `ExtractionResult` with extracted content and metadata.
///
/// # Errors
///
/// Returns an error if:
/// - The format is not supported
/// - The data is corrupted or malformed
/// - Parsing fails for any reason
///
/// # Examples
///
/// ```no_run
/// use omniparse::extract_from_bytes;
///
/// // Extract with automatic type detection
/// let data = std::fs::read("file.json")?;
/// let result = extract_from_bytes(&data, None)?;
///
/// // Extract with MIME type hint
/// let result = extract_from_bytes(&data, Some("application/json"))?;
/// # Ok::<(), omniparse::Error>(())
/// ```
pub fn extract_from_bytes(data: &[u8], mime_hint: Option<&str>) -> Result<ExtractionResult> {
    let extractor = core::Extractor::new();
    extractor.extract_from_bytes(data, mime_hint)
}

/// Get a list of all supported MIME types.
///
/// This function returns all MIME types that have registered parsers in the system.
/// You can use this to check what formats are supported before attempting extraction.
///
/// # Returns
///
/// A vector of MIME type strings (e.g., "application/pdf", "text/plain")
///
/// # Examples
///
/// ```
/// use omniparse::supported_mime_types;
///
/// let types = supported_mime_types();
/// println!("Omniparse supports {} formats", types.len());
/// for mime_type in types {
///     println!("  - {}", mime_type);
/// }
/// ```
pub fn supported_mime_types() -> Vec<String> {
    parsers::default_registry().supported_types()
}

/// Check if a specific MIME type is supported.
///
/// This is a convenience function to quickly check if a format can be processed
/// without needing to iterate through all supported types.
///
/// # Arguments
///
/// * `mime_type` - The MIME type to check (e.g., "application/pdf")
///
/// # Returns
///
/// `true` if the MIME type is supported, `false` otherwise
///
/// # Examples
///
/// ```
/// use omniparse::is_mime_supported;
///
/// if is_mime_supported("application/pdf") {
///     println!("PDF files are supported!");
/// }
///
/// if !is_mime_supported("application/x-custom") {
///     println!("Custom format is not supported");
/// }
/// ```
pub fn is_mime_supported(mime_type: &str) -> bool {
    parsers::default_registry().get_parser(mime_type).is_some()
}

/// Extract text and metadata from a file asynchronously.
///
/// This is the async version of `extract_from_path`, using Tokio for async file I/O.
/// It provides the same functionality but allows for non-blocking operation in async contexts.
///
/// **Note:** This function is only available when the `async` feature is enabled.
///
/// # Arguments
///
/// * `path` - Path to the file to extract content from
///
/// # Returns
///
/// Returns an `ExtractionResult` containing the extracted content and metadata.
///
/// # Errors
///
/// Returns an error if:
/// - The file cannot be read (IO error)
/// - The file format is not supported
/// - The file is corrupted or malformed
/// - Parsing fails for any reason
///
/// # Examples
///
/// ```no_run
/// # #[cfg(feature = "async")]
/// # async fn example() -> Result<(), omniparse::Error> {
/// use omniparse::extract_from_path_async;
///
/// let result = extract_from_path_async("document.pdf").await?;
/// println!("Detected type: {}", result.mime_type);
/// # Ok(())
/// # }
/// ```
#[cfg(feature = "async")]
pub async fn extract_from_path_async(path: impl AsRef<Path>) -> Result<ExtractionResult> {
    use tokio::io::AsyncReadExt;
    
    let path = path.as_ref();
    
    // Read the file asynchronously
    let mut file = tokio::fs::File::open(path).await?;
    let mut buffer = Vec::new();
    file.read_to_end(&mut buffer).await?;
    
    // Use the synchronous extraction logic on the buffered data
    // We also need to detect the type from the path
    let extractor = core::Extractor::new();
    let detection = extractor.detector.detect_from_path(path)?;
    
    // Get the parser and parse the data
    let parser = extractor.registry.get_parser(&detection.mime_type)
        .ok_or_else(|| Error::UnsupportedFormat(detection.mime_type.clone()))?;
    
    let mut result = parser.parse(&buffer, &detection.mime_type)?;
    result.detection_confidence = detection.confidence;
    result.mime_type = detection.mime_type;
    
    Ok(result)
}