deformat 0.1.0

Extract plain text from HTML, PDF, and other document formats
Documentation
# deformat

Extract plain text from HTML, PDF, and other document formats.

NER engines, LLM pipelines, and search indexers need plain text. `deformat`
sits upstream: it takes formatted documents and returns clean text. No network
I/O -- it operates on `&str` and `&[u8]` inputs.

## Quick start

```rust
use deformat::{extract, Format};

// Auto-detect format and extract text
let result = extract("<p>Hello <b>world</b>!</p>");
assert_eq!(result.text, "Hello world!");
assert_eq!(result.format, Format::Html);

// Plain text passes through unchanged
let result = extract("Just plain text.");
assert_eq!(result.text, "Just plain text.");
```

## Feature flags

All features are opt-in. The default build has zero heavy dependencies
(only `once_cell` and `regex`).

| Feature | Crate | What it adds |
|---------|-------|-------------|
| `readability` | `dom_smoothie` | Mozilla Readability article extraction |
| `html2text` | `html2text` | DOM-based HTML-to-text with layout awareness |
| `pdf` | `pdf-extract` | PDF text extraction |

```toml
[dependencies]
deformat = { version = "0.1", features = ["readability", "html2text"] }
```

## HTML extraction

Three strategies, from simplest to most capable:

1. **`html::strip_to_text`** (always available) -- fast regex/char-based tag
   stripping with HTML entity decoding, semantic element filtering (`<nav>`,
   `<header>`, `<footer>`, `<aside>`), and Wikipedia boilerplate removal.

2. **`extract_html2text`** (feature `html2text`) -- DOM-based conversion that
   preserves layout structure (tables, lists, indentation).

3. **`extract_readable`** (feature `readability`) -- Mozilla Readability
   algorithm that extracts the main article content, stripping navigation,
   sidebars, and boilerplate. Falls back to `strip_to_text` if extraction
   produces insufficient content.

## Format detection

```rust
use deformat::detect::{is_html, is_pdf, detect_str, detect_bytes, detect_path};
use deformat::Format;

assert!(is_html("<!DOCTYPE html><html>..."));
assert_eq!(detect_str("<html><body>Hello</body></html>"), Format::Html);
assert_eq!(detect_path("report.pdf"), Format::Pdf);
```

## License

MIT OR Apache-2.0