Crate halldyll_core

Crate halldyll_core 

Source
Expand description

§Halldyll Core

High-performance async web scraping engine designed for AI data collection.

§Features

  • Async HTTP Fetching: Connection pooling, compression, retries with exponential backoff
  • Crawl Management: URL normalization (RFC 3986), frontier scheduling, deduplication
  • Politeness: robots.txt (RFC 9309), adaptive rate limiting per domain
  • Content Extraction: Text, links, images, videos, structured data (JSON-LD, OpenGraph)
  • Security: SSRF protection, domain allowlists, resource limits
  • Storage: WARC (ISO 28500), snapshots with content hashing
  • Observability: Structured logging, metrics, distributed tracing

§Example

use halldyll_core::{Orchestrator, Config};
use url::Url;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = Config::default();
    let orchestrator = Orchestrator::new(config)?;
     
    let url = Url::parse("https://example.com")?;
    let result = orchestrator.scrape(&url).await?;
     
    println!("Title: {:?}", result.document.title);
    println!("Text length: {}", result.document.main_text.len());
    Ok(())
}

Re-exports§

pub use types::Document;
pub use types::Assets;
pub use types::Provenance;
pub use types::Error;
pub use types::Config;
pub use types::error::Result;
pub use orchestrator::Orchestrator;
pub use fetch::CircuitBreaker;
pub use fetch::CircuitBreakerConfig;
pub use observe::HealthChecker;
pub use observe::HealthResponse;
pub use observe::HealthStatus;
pub use observe::HealthMetrics;
pub use observe::PrometheusExporter;
pub use observe::MetricsCollector;
pub use observe::MetricsSnapshot;
pub use observe::GracefulShutdown;
pub use observe::ShutdownResult;

Modules§

crawl
Crawl - Frontier, normalization, deduplication
fetch
HTTP Fetcher - Robust web page fetching
observe
Observe - Logs, metrics, traces, health checks, graceful shutdown
orchestrator
Orchestrator - Main scraper orchestration with full component integration
parse
Parse - Content extraction (text, links, images, videos, metadata)
politeness
Politeness - Robots.txt and throttling
render
Render - JS rendering decision and headless browser integration
security
Security - Validation and protection
sitemap
Sitemap - Sitemap parsing
storage
Storage - Storage of snapshots and normalized documents
types
Core data types for the scraper