Skip to main content

Crate hanzo_extract

Crate hanzo_extract 

Source
Expand description

§Hanzo Extract

Content extraction with built-in sanitization via hanzo-guard.

This crate provides utilities for extracting text content from various sources (web pages, PDFs, etc.) and sanitizing the output for safe use with LLMs.

§Features

  • Web Extraction: Fetch and extract clean text from web pages
  • PDF Extraction: Extract text from PDF documents
  • Sanitization: Automatic PII redaction and injection detection via hanzo-guard

§Example

use hanzo_extract::{WebExtractor, ExtractorConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let extractor = WebExtractor::new(ExtractorConfig::default());
    let result = extractor.extract("https://example.com").await?;
    println!("Extracted: {}", result.text);
    Ok(())
}

§Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│   Source    │ ──► │  Extractor   │ ──► │  Hanzo Guard    │
│ (URL/PDF)   │     │ (Text Parse) │     │ (Sanitization)  │
└─────────────┘     └──────────────┘     └─────────────────┘
                                                  │
                                                  ▼
                                         ┌─────────────────┐
                                         │  Clean Output   │
                                         │ (LLM-Ready)     │
                                         └─────────────────┘

Re-exports§

pub use config::ExtractorConfig;
pub use error::ExtractError;
pub use error::Result;
pub use result::ExtractResult;
pub use web::WebExtractor;
pub use pdf::PdfExtractor;

Modules§

config
Extractor configuration
error
Error types for content extraction
pdf
PDF document content extraction
result
Extraction result types
sanitize
Content sanitization using hanzo-guard
web
Web page content extraction

Traits§

Extractor
Common trait for all extractors