Expand description
§Hanzo Extract
Content extraction with built-in sanitization via hanzo-guard.
This crate provides utilities for extracting text content from various sources (web pages, PDFs, etc.) and sanitizing the output for safe use with LLMs.
§Features
- Web Extraction: Fetch and extract clean text from web pages
- PDF Extraction: Extract text from PDF documents
- Sanitization: Automatic PII redaction and injection detection via
hanzo-guard
§Example
ⓘ
use hanzo_extract::{WebExtractor, ExtractorConfig};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let extractor = WebExtractor::new(ExtractorConfig::default());
let result = extractor.extract("https://example.com").await?;
println!("Extracted: {}", result.text);
Ok(())
}§Architecture
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ Source │ ──► │ Extractor │ ──► │ Hanzo Guard │
│ (URL/PDF) │ │ (Text Parse) │ │ (Sanitization) │
└─────────────┘ └──────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Clean Output │
│ (LLM-Ready) │
└─────────────────┘Re-exports§
pub use config::ExtractorConfig;pub use error::ExtractError;pub use error::Result;pub use result::ExtractResult;pub use web::WebExtractor;pub use pdf::PdfExtractor;
Modules§
- config
- Extractor configuration
- error
- Error types for content extraction
- PDF document content extraction
- result
- Extraction result types
- sanitize
- Content sanitization using hanzo-guard
- web
- Web page content extraction
Traits§
- Extractor
- Common trait for all extractors