Crate halldyll_core

Expand description

§Halldyll Core

High-performance async web scraping engine designed for AI data collection.

§Features

Async HTTP Fetching: Connection pooling, compression, retries with exponential backoff
Crawl Management: URL normalization (RFC 3986), frontier scheduling, deduplication
Politeness: robots.txt (RFC 9309), adaptive rate limiting per domain
Content Extraction: Text, links, images, videos, structured data (JSON-LD, OpenGraph)
Security: SSRF protection, domain allowlists, resource limits
Storage: WARC (ISO 28500), snapshots with content hashing
Observability: Structured logging, metrics, distributed tracing

§Example

use halldyll_core::{Orchestrator, Config};
use url::Url;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = Config::default();
    let orchestrator = Orchestrator::new(config)?;
     
    let url = Url::parse("https://example.com")?;
    let result = orchestrator.scrape(&url).await?;
     
    println!("Title: {:?}", result.document.title);
    println!("Text length: {}", result.document.main_text.len());
    Ok(())
}

Re-exports§

pub use types::Document;
pub use types::Assets;
pub use types::Provenance;
pub use types::Error;
pub use types::Config;
pub use types::error::Result;
pub use orchestrator::Orchestrator;
pub use fetch::CircuitBreaker;
pub use fetch::CircuitBreakerConfig;
pub use observe::HealthChecker;
pub use observe::HealthResponse;
pub use observe::HealthStatus;
pub use observe::HealthMetrics;
pub use observe::PrometheusExporter;
pub use observe::MetricsCollector;
pub use observe::MetricsSnapshot;
pub use observe::GracefulShutdown;
pub use observe::ShutdownResult;

Modules§

crawl: Crawl - Frontier, normalization, deduplication
fetch: HTTP Fetcher - Robust web page fetching
observe: Observe - Logs, metrics, traces, health checks, graceful shutdown
orchestrator: Orchestrator - Main scraper orchestration with full component integration
parse: Parse - Content extraction (text, links, images, videos, metadata)
politeness: Politeness - Robots.txt and throttling
render: Render - JS rendering decision and headless browser integration
security: Security - Validation and protection
sitemap: Sitemap - Sitemap parsing
storage: Storage - Storage of snapshots and normalized documents
types: Core data types for the scraper

Crate halldyll_core

Crate halldyll_core Copy item path

§Halldyll Core

§Features

§Example

Re-exports§

Modules§

Crate halldyll_core