deformat
Extract plain text from HTML, PDF, and other document formats.
NER engines, LLM pipelines, and search indexers need plain text. deformat
sits upstream: it takes formatted documents and returns clean text. No network
I/O -- it operates on &str and &[u8] inputs.
Quick start
use ;
// Auto-detect format and extract text
let result = extract;
assert_eq!;
assert_eq!;
// Plain text passes through unchanged
let result = extract;
assert_eq!;
Feature flags
All features are opt-in. The default build has zero heavy dependencies
(only once_cell and regex).
| Feature | Crate | What it adds |
|---|---|---|
readability |
dom_smoothie |
Mozilla Readability article extraction |
html2text |
html2text |
DOM-based HTML-to-text with layout awareness |
pdf |
pdf-extract |
PDF text extraction |
[]
= { = "0.1", = ["readability", "html2text"] }
HTML extraction
Three strategies, from simplest to most capable:
-
html::strip_to_text(always available) -- fast regex/char-based tag stripping with HTML entity decoding, semantic element filtering (<nav>,<header>,<footer>,<aside>), and Wikipedia boilerplate removal. -
extract_html2text(featurehtml2text) -- DOM-based conversion that preserves layout structure (tables, lists, indentation). -
extract_readable(featurereadability) -- Mozilla Readability algorithm that extracts the main article content, stripping navigation, sidebars, and boilerplate. Falls back tostrip_to_textif extraction produces insufficient content.
Format detection
use ;
use Format;
assert!;
assert_eq!;
assert_eq!;
License
MIT OR Apache-2.0