Crate deformat

Expand description

De-format: extract plain text from HTML, PDF, and other document formats.

NER engines, LLM pipelines, and search indexers need plain text. deformat sits upstream: it takes formatted documents and returns clean text. No I/O – it operates on &str and &[u8] inputs.

§Quick start

use deformat::{extract, Format};

// Auto-detect format and extract text
let result = extract("<p>Hello <b>world</b>!</p>");
assert_eq!(result.text, "Hello world!");
assert_eq!(result.format, Format::Html);

// Plain text passes through unchanged
let result = extract("Just plain text.");
assert_eq!(result.text, "Just plain text.");
assert_eq!(result.format, Format::PlainText);

§Feature flags

All features are opt-in. The default build has one dependency: memchr (SIMD-accelerated byte scanning).

Feature	Crate	What it adds
`readability`	`dom_smoothie`	Mozilla Readability article extraction
`html2text`	`html2text`	DOM-based HTML-to-text with layout awareness
`pdf`	`pdf-extract`	PDF text extraction from file paths

Re-exports§

pub use detect::Format;
pub use error::Error;

Modules§

detect: Format detection from content bytes, strings, and file extensions.
error: Error types for extraction failures.
html: HTML-to-text extraction.

Structs§

Extracted: Extracted text with metadata about the source document.

Functions§

extract: Extract plain text from content, auto-detecting the format.
extract_as: Extract plain text with an explicit format override.

Crate deformat

Crate deformat Copy item path

§Quick start

§Feature flags

Re-exports§

Modules§

Structs§

Functions§

Crate deformat