omniparse 0.4.1

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project

Omniparse is a Rust library + CLI for detecting file types and extracting text/metadata from 25+ formats (text, Office/OpenDocument, PDF, images, archives). Pure Rust, no FFI. Apache Tika inspired.

Published crate: `omniparse` on crates.io. Edition 2024.

## Common commands

```bash
# Build
cargo build                  # debug
cargo build --release        # optimized binary at target/release/omniparse

# Features (off by default)
cargo build --features async     # enables tokio-based extract_from_path_async
cargo build --features parallel  # enables rayon-based process_files_parallel
cargo build --all-features

# Tests
cargo test                              # all tests
cargo test --all-features               # includes async/parallel paths
cargo test --test integration_test      # single test file in tests/
cargo test parser_tests::test_pdf       # single test by name
cargo test -- --nocapture               # show println! output

# Lint / format
cargo fmt
cargo clippy --all-targets --all-features -- -D warnings

# Docs
cargo doc --open

# Examples (see examples/ — note Cargo is not configured with [[example]] entries,
# so run via: cargo run --example <name> only if declared; otherwise compile ad-hoc)
cargo run --release -- <file>                  # run CLI
cargo run --release -- --format json --parallel *.pdf
```

## Architecture

Extraction pipeline: **detect → dispatch → parse**. Single orchestrator (`core::Extractor`) wires a `TypeDetector` to a `ParserRegistry`.

- `src/lib.rs` — public API (`extract_from_path`, `extract_from_bytes`, `extract_from_path_async`, `supported_mime_types`, `is_mime_supported`). Each call constructs a fresh `Extractor` + `ParserRegistry::default()` — registry build is cheap but not cached; callers doing many extractions should hold their own `Extractor`.
- `src/core/` — `Extractor` (orchestrator), `Error`/`Result` (thiserror-based; note `Error::PartialExtraction` carries a `partial_result`), `ExtractionResult` + `Content` enum + `Metadata` map.
- `src/detection/` — `TypeDetector` combines magic bytes, content heuristics, and extension fallback. Returns `DetectionResult { mime_type, confidence, method }` where `confidence` ∈ [0,1] and reflects method reliability (magic > content > extension).
- `src/parsers/` — one module per category (`text`, `document`, `image`, `archive`). All implement the `Parser` trait (`supported_types`, `parse`, optional `parse_stream`, `name`). Must be `Send + Sync`.
- `src/parsers/mod.rs` — `ParserRegistry::default()` is the single source of truth for which parsers ship. **To add a new format**: implement `Parser`, add the module, then register the parser in `ParserRegistry::default()`. Nothing else discovers parsers.
- `src/cli/` — clap-derived `Cli` args + output formatters (json/yaml/text via serde_yaml/serde_json).
- `src/utils/parallel.rs` — `process_files_parallel` (requires `parallel` feature / rayon).
- `src/utils/streaming.rs` — stream helpers; parsers can override `parse_stream` for large files.

### Async caveat

`extract_from_path_async` in `lib.rs` reads the whole file into memory via tokio, then runs the sync parser. It's async I/O, not async parsing. Detection is run against the path (not bytes) — so the file must have an extension/magic the path-based detector recognizes.

### MIME dispatch

`Parser::supported_types()` returns string slices; `ParserRegistry::register` indexes them into a `HashMap<String, usize>`. Later registrations silently overwrite earlier ones for the same MIME. Legacy Office formats (DOC/XLS/PPT) have distinct parsers from modern (DOCX/XLSX/PPTX) — keep them separate.

## Tests

Tests live in `tests/` (integration style) — each file is a separate crate. Fixtures generated by `examples/create_*_fixtures.rs` into `test_data/`. If adding a new parser, add cases to `tests/parser_tests.rs` and `tests/parser_registration_test.rs` (the latter guards that `ParserRegistry::default()` actually includes it).

Performance tests in `tests/performance_test.rs` — targets documented in `PERFORMANCE_BENCHMARK_REPORT.md`.

## Conventions

- Errors: return `crate::core::Error` variants. Use `Error::UnsupportedFormat(mime)` when registry misses; `Error::CorruptedFile(msg)` for malformed input; `Error::PartialExtraction { message, partial_result }` when some content extracted before failure.
- Metadata: use `Metadata::set(key, MetadataValue::...)` with well-known keys (`title`, `author`, etc. — helpers on `Metadata`). Format-specific keys (`sheet_count`, `slide_count`) are free-form strings.
- Confidence score must be populated on `ExtractionResult` — `Extractor` overwrites parser's value with detector's, so parsers can set any placeholder.

## Docs index

`README.md`, `SUPPORTED_FORMATS.md`, `CHANGELOG.md`, `RELEASE_NOTES_v0.2.0.md`, `PERFORMANCE_BENCHMARK_REPORT.md`, `PERFORMANCE_DOCUMENTATION_INDEX.md`, `examples/README.md`, `examples/WEB_SERVICE_GUIDE.md`.