Expand description
De-format: extract plain text from HTML, PDF, and other document formats.
NER engines, LLM pipelines, and search indexers need plain text.
deformat sits upstream: it takes formatted documents and returns clean
text. No I/O – it operates on &str and &[u8] inputs.
§Quick start
use deformat::{extract, Format};
// Auto-detect format and extract text
let result = extract("<p>Hello <b>world</b>!</p>");
assert_eq!(result.text, "Hello world!");
assert_eq!(result.format, Format::Html);
// Plain text passes through unchanged
let result = extract("Just plain text.");
assert_eq!(result.text, "Just plain text.");
assert_eq!(result.format, Format::PlainText);§Feature flags
All features are opt-in. The default build has zero non-core dependencies
beyond once_cell and regex (used for HTML entity decoding and
boilerplate removal).
| Feature | Crate | What it adds |
|---|---|---|
readability | [dom_smoothie] | Mozilla Readability article extraction |
html2text | [html2text] | DOM-based HTML-to-text with layout awareness |
pdf | [pdf-extract] | PDF text extraction from file paths |
Re-exports§
Modules§
- detect
- Format detection from content bytes, strings, and file extensions.
- error
- Error types for extraction failures.
- html
- HTML-to-text extraction.
Structs§
- Extracted
- Extracted text with metadata about the source document.
Functions§
- extract
- Extract plain text from content, auto-detecting the format.
- extract_
as - Extract plain text with an explicit format override.