Expand description
De-format: extract plain text from HTML, PDF, and other document formats.
NER engines, LLM pipelines, and search indexers need plain text.
deformat sits upstream: it takes formatted documents and returns clean
text. No I/O – it operates on &str and &[u8] inputs.
§Quick start
use deformat::{extract, Format};
// Auto-detect format and extract text
let result = extract("<p>Hello <b>world</b>!</p>");
assert_eq!(result.text, "Hello world!");
assert_eq!(result.format, Format::Html);
// Plain text passes through unchanged
let result = extract("Just plain text.");
assert_eq!(result.text, "Just plain text.");
assert_eq!(result.format, Format::PlainText);§Feature flags
All features are opt-in. The default build has one dependency: memchr
(SIMD-accelerated byte scanning).
| Feature | Crate | What it adds |
|---|---|---|
readability | dom_smoothie | Mozilla Readability article extraction |
html2text | html2text | DOM-based HTML-to-text with layout awareness |
pdf | pdf-extract | PDF text extraction from file paths |
Re-exports§
Modules§
- detect
- Format detection from content bytes, strings, and file extensions.
- error
- Error types for extraction failures.
- html
- HTML-to-text extraction.
Structs§
- Extracted
- Extracted text with metadata about the source document.
Functions§
- extract
- Extract plain text from content, auto-detecting the format.
- extract_
as - Extract plain text with an explicit format override.