deformat 0.1.0

Extract plain text from HTML, PDF, and other document formats
Documentation

deformat

Extract plain text from HTML, PDF, and other document formats.

NER engines, LLM pipelines, and search indexers need plain text. deformat sits upstream: it takes formatted documents and returns clean text. No network I/O -- it operates on &str and &[u8] inputs.

Quick start

use deformat::{extract, Format};

// Auto-detect format and extract text
let result = extract("<p>Hello <b>world</b>!</p>");
assert_eq!(result.text, "Hello world!");
assert_eq!(result.format, Format::Html);

// Plain text passes through unchanged
let result = extract("Just plain text.");
assert_eq!(result.text, "Just plain text.");

Feature flags

All features are opt-in. The default build has zero heavy dependencies (only once_cell and regex).

Feature Crate What it adds
readability dom_smoothie Mozilla Readability article extraction
html2text html2text DOM-based HTML-to-text with layout awareness
pdf pdf-extract PDF text extraction
[dependencies]
deformat = { version = "0.1", features = ["readability", "html2text"] }

HTML extraction

Three strategies, from simplest to most capable:

  1. html::strip_to_text (always available) -- fast regex/char-based tag stripping with HTML entity decoding, semantic element filtering (<nav>, <header>, <footer>, <aside>), and Wikipedia boilerplate removal.

  2. extract_html2text (feature html2text) -- DOM-based conversion that preserves layout structure (tables, lists, indentation).

  3. extract_readable (feature readability) -- Mozilla Readability algorithm that extracts the main article content, stripping navigation, sidebars, and boilerplate. Falls back to strip_to_text if extraction produces insufficient content.

Format detection

use deformat::detect::{is_html, is_pdf, detect_str, detect_bytes, detect_path};
use deformat::Format;

assert!(is_html("<!DOCTYPE html><html>..."));
assert_eq!(detect_str("<html><body>Hello</body></html>"), Format::Html);
assert_eq!(detect_path("report.pdf"), Format::Pdf);

License

MIT OR Apache-2.0