deformat
Extracts plain text from HTML, PDF, and other document formats. Operates on
&str and &[u8] inputs -- no network I/O, no filesystem access (except
PDF file extraction).
Supported formats
| Format | Input | Feature flag | Extractor |
|---|---|---|---|
| HTML (tag strip) | &str |
(none -- always available) | html::strip_to_text |
| HTML (layout-aware) | &str |
html2text |
extract_html2text |
| HTML (article) | &str |
readability |
extract_readable |
&Path or &[u8] |
pdf |
pdf::extract_file, pdf::extract_bytes |
|
| Plain text / Markdown | &str |
(none) | passthrough |
The default build depends only on memchr.
Install
[]
= { = "0.4.1", = ["readability", "html2text"] }
Usage
Auto-detect and extract
use ;
let result = extract;
assert_eq!;
assert_eq!;
// Plain text passes through unchanged
let result = extract;
assert_eq!;
assert_eq!;
All extraction functions return an Extracted struct:
HTML strategies
// 1. Tag stripping (always available, fast)
let text = strip_to_text;
assert_eq!;
// Standalone entity decoding
assert_eq!;
// 2. Layout-aware DOM conversion (feature: html2text)
let result = extract_html2text;
// 3. Article extraction via Mozilla Readability (feature: readability)
// Falls back to tag stripping if content is too short (< 50 chars).
let result = extract_readable;
PDF extraction
// From file path (feature: pdf)
let result = extract_file?;
// From bytes in memory
let result = extract_bytes?;
Format detection
use ;
use Format;
assert!;
assert_eq!;
assert_eq!;
assert_eq!;
HTML tag stripping details
html::strip_to_text handles: tag removal, script/style/noscript content removal,
semantic element filtering (<nav>, <header>, <footer>, <aside>, <form>,
etc.), ~300 named HTML entities (Latin, Greek, math, typography), numeric/hex character
references, Windows-1252 C1 range mapping, CJK ruby annotation stripping, Wikipedia
boilerplate removal, reference marker stripping ([1], [edit]), image alt text
extraction, and whitespace collapsing.
License
MIT OR Apache-2.0