deformat

Extracts plain text from HTML, PDF, and other document formats. Operates on &str and &[u8] inputs -- no network I/O, no filesystem access (except PDF file extraction).

Supported formats

Format	Input	Feature flag	Extractor
HTML (tag strip)	`&str`	(none -- always available)	`html::strip_to_text`
HTML (layout-aware)	`&str`	`html2text`	`extract_html2text`
HTML (article)	`&str`	`readability`	`extract_readable`
PDF	`&Path` or `&[u8]`	`pdf`	`pdf::extract_file`, `pdf::extract_bytes`
Plain text / Markdown	`&str`	(none)	passthrough

The default build depends only on memchr.

Install

cargo add deformat                                        # minimal
cargo add deformat --features readability,html2text,pdf   # all extractors

[dependencies]
deformat = { version = "0.4.1", features = ["readability", "html2text"] }

Usage

Auto-detect and extract

use deformat::{extract, Format};

let result = extract("<p>Hello <b>world</b>!</p>");
assert_eq!(result.text, "Hello world!");
assert_eq!(result.format, Format::Html);

// Plain text passes through unchanged
let result = extract("Just plain text.");
assert_eq!(result.text, "Just plain text.");
assert_eq!(result.format, Format::PlainText);

All extraction functions return an Extracted struct:

pub struct Extracted {
    pub text: String,
    pub format: Format,
    pub metadata: HashMap<String, String>,  // e.g. "extractor", "title", "excerpt"
}

HTML strategies

// 1. Tag stripping (always available, fast)
let text = deformat::html::strip_to_text("<p>Hello <b>world</b>!</p>");
assert_eq!(text, "Hello world!");

// Standalone entity decoding
assert_eq!(deformat::html::decode_entities("Caf&eacute;"), "Cafe\u{0301}");

// 2. Layout-aware DOM conversion (feature: html2text)
let result = deformat::extract_html2text("<table><tr><td>A</td></tr></table>", 80);

// 3. Article extraction via Mozilla Readability (feature: readability)
//    Falls back to tag stripping if content is too short (< 50 chars).
let result = deformat::extract_readable(html, Some("https://example.com/article"));

PDF extraction

// From file path (feature: pdf)
let result = deformat::pdf::extract_file(std::path::Path::new("report.pdf"))?;

// From bytes in memory
let result = deformat::pdf::extract_bytes(&pdf_bytes)?;

Format detection

use deformat::detect::{is_html, is_pdf, detect_str, detect_bytes, detect_path};
use deformat::Format;

assert!(is_html("<!DOCTYPE html><html>..."));
assert_eq!(detect_str("<html><body>Hello</body></html>"), Format::Html);
assert_eq!(detect_bytes(b"%PDF-1.4 ..."), Format::Pdf);
assert_eq!(detect_path("report.pdf"), Format::Pdf);

HTML tag stripping details

html::strip_to_text handles: tag removal, script/style/noscript content removal, semantic element filtering (<nav>, <header>, <footer>, <aside>, <form>, etc.), ~300 named HTML entities (Latin, Greek, math, typography), numeric/hex character references, Windows-1252 C1 range mapping, CJK ruby annotation stripping, Wikipedia boilerplate removal, reference marker stripping ([1], [edit]), image alt text extraction, and whitespace collapsing.

License

MIT OR Apache-2.0

deformat 0.4.2