hanzo-extract
Content extraction library for Rust with built-in sanitization via hanzo-guard. Extract clean text from web pages and PDF documents with automatic PII redaction and safety filtering.
Features
- Web Extraction: Fetch and extract clean text from web pages with smart content detection
- PDF Extraction: Extract text from PDF files with metadata preservation
- Built-in Sanitization: Optional PII redaction and safety filtering via hanzo-guard
- Async/Await: Non-blocking I/O for high-performance applications
- Configurable: Timeout, redirect handling, content length limits, user agent
Quick Start
use ;
async
Extractors
Web Extractor
Extracts clean text from HTML web pages:
use ;
let config = ExtractorConfig ;
let extractor = new;
let result = extractor.extract.await?;
Features:
- Smart content area detection (article, main, content divs)
- Script/style tag removal
- Whitespace normalization
- Title extraction
PDF Extractor
Extracts text from PDF documents:
use ;
let extractor = default;
// From file path
let result = extractor.extract.await?;
// From URL (requires 'web' feature)
let result = extractor.extract.await?;
println!;
println!;
Features:
- Page-by-page text extraction
- PDF metadata extraction (title, author)
- URL fetching support
- Whitespace normalization
Sanitized Extraction
Enable the sanitize feature for automatic PII redaction:
= { = "0.1", = ["sanitize"] }
use ;
let extractor = default;
// Extract with automatic sanitization
let result = extractor.extract_sanitized.await?;
if result.sanitized
Configuration
use ExtractorConfig;
let config = ExtractorConfig ;
Feature Flags
| Feature | Default | Description |
|---|---|---|
web |
Yes | Web page extraction with reqwest |
pdf |
Yes | PDF extraction with lopdf |
sanitize |
Yes | PII redaction via hanzo-guard |
# Web only
= { = "0.1", = false, = ["web"] }
# PDF only
= { = "0.1", = false, = ["pdf"] }
# No sanitization
= { = "0.1", = false, = ["web", "pdf"] }
Extraction Result
Error Handling
use ;
match extractor.extract.await
Performance
| Operation | Latency | Notes |
|---|---|---|
| Web fetch + extract | ~100-500ms | Network dependent |
| HTML parsing | ~1-5ms | Content size dependent |
| PDF extraction | ~10-50ms | Page count dependent |
| Sanitization | ~100μs | Via hanzo-guard |
License
Licensed under either of Apache License, Version 2.0 or MIT license at your option.
Related
- hanzo-guard - LLM I/O sanitization layer
- Zen Guard - ML-based safety classification