deformat

Extract plain text from HTML, PDF, and other document formats.

NER engines, LLM pipelines, and search indexers need plain text. deformat sits upstream: it takes formatted documents and returns clean text. No network I/O -- it operates on &str and &[u8] inputs.

Quick start

use deformat::{extract, Format};

// Auto-detect format and extract text
let result = extract("<p>Hello <b>world</b>!</p>");
assert_eq!(result.text, "Hello world!");
assert_eq!(result.format, Format::Html);

// Plain text passes through unchanged
let result = extract("Just plain text.");
assert_eq!(result.text, "Just plain text.");

Feature flags

All features are opt-in. The default build has zero heavy dependencies (only once_cell and regex).

Feature	Crate	What it adds
`readability`	`dom_smoothie`	Mozilla Readability article extraction
`html2text`	`html2text`	DOM-based HTML-to-text with layout awareness
`pdf`	`pdf-extract`	PDF text extraction

[dependencies]
deformat = { version = "0.1", features = ["readability", "html2text"] }

HTML extraction

Three strategies, from simplest to most capable:

html::strip_to_text (always available) -- fast regex/char-based tag stripping with HTML entity decoding, semantic element filtering (<nav>, <header>, <footer>, <aside>), and Wikipedia boilerplate removal.
extract_html2text (feature html2text) -- DOM-based conversion that preserves layout structure (tables, lists, indentation).
extract_readable (feature readability) -- Mozilla Readability algorithm that extracts the main article content, stripping navigation, sidebars, and boilerplate. Falls back to strip_to_text if extraction produces insufficient content.

Format detection

use deformat::detect::{is_html, is_pdf, detect_str, detect_bytes, detect_path};
use deformat::Format;

assert!(is_html("<!DOCTYPE html><html>..."));
assert_eq!(detect_str("<html><body>Hello</body></html>"), Format::Html);
assert_eq!(detect_path("report.pdf"), Format::Pdf);

License