Skip to main content

Crate deformat

Crate deformat 

Source
Expand description

De-format: extract plain text from HTML, PDF, and other document formats.

NER engines, LLM pipelines, and search indexers need plain text. deformat sits upstream: it takes formatted documents and returns clean text. No I/O – it operates on &str and &[u8] inputs.

§Quick start

use deformat::{extract, Format};

// Auto-detect format and extract text
let result = extract("<p>Hello <b>world</b>!</p>");
assert_eq!(result.text, "Hello world!");
assert_eq!(result.format, Format::Html);

// Plain text passes through unchanged
let result = extract("Just plain text.");
assert_eq!(result.text, "Just plain text.");
assert_eq!(result.format, Format::PlainText);

§Feature flags

All features are opt-in. The default build has one dependency: memchr (SIMD-accelerated byte scanning).

FeatureCrateWhat it adds
readabilitydom_smoothieMozilla Readability article extraction
html2texthtml2textDOM-based HTML-to-text with layout awareness
pdfpdf-extractPDF text extraction from file paths

Re-exports§

pub use detect::Format;
pub use error::Error;

Modules§

detect
Format detection from content bytes, strings, and file extensions.
error
Error types for extraction failures.
html
HTML-to-text extraction.

Structs§

Extracted
Extracted text with metadata about the source document.

Functions§

extract
Extract plain text from content, auto-detecting the format.
extract_as
Extract plain text with an explicit format override.