# deformat
Extract plain text from HTML, PDF, and other document formats.
NER engines, LLM pipelines, and search indexers need plain text. `deformat`
sits upstream: it takes formatted documents and returns clean text. No network
I/O -- it operates on `&str` and `&[u8]` inputs.
## Quick start
```rust
use deformat::{extract, Format};
// Auto-detect format and extract text
let result = extract("<p>Hello <b>world</b>!</p>");
assert_eq!(result.text, "Hello world!");
assert_eq!(result.format, Format::Html);
// Plain text passes through unchanged
let result = extract("Just plain text.");
assert_eq!(result.text, "Just plain text.");
```
## Feature flags
All features are opt-in. The default build has one dependency: `memchr`.
| `readability` | `dom_smoothie` | Mozilla Readability article extraction |
| `html2text` | `html2text` | DOM-based HTML-to-text with layout awareness |
| `pdf` | `pdf-extract` | PDF text extraction |
```toml
[dependencies]
deformat = { version = "0.4", features = ["readability", "html2text"] }
```
## HTML extraction
Three strategies, from simplest to most capable:
1. **`html::strip_to_text`** (always available) -- fast byte-level tag stripping
with ~300 named HTML entities (ISO-8859-1, Latin Extended-A for Central/Eastern
European names, Greek, math, typography), Windows-1252 C1 range mapping,
CJK ruby annotation stripping, semantic element filtering, image alt text
extraction, and Wikipedia boilerplate removal.
2. **`extract_html2text`** (feature `html2text`) -- DOM-based conversion that
preserves layout structure (tables, lists, indentation).
3. **`extract_readable`** (feature `readability`) -- Mozilla Readability
algorithm that extracts the main article content, stripping navigation,
sidebars, and boilerplate. Falls back to `strip_to_text` if extraction
produces insufficient content.
### Entity decoding
```rust
// Standalone entity decoding (useful for attribute values, etc.)
assert_eq!(deformat::html::decode_entities("Café"), "CafĂ©");
assert_eq!(deformat::html::decode_entities("© 2026"), "\u{00A9} 2026");
```
## Format detection
```rust
use deformat::detect::{is_html, is_pdf, detect_str, detect_bytes, detect_path};
use deformat::Format;
assert!(is_html("<!DOCTYPE html><html>..."));
assert_eq!(detect_str("<html><body>Hello</body></html>"), Format::Html);
assert_eq!(detect_path("report.pdf"), Format::Pdf);
```
## License
MIT OR Apache-2.0