Expand description
§Halldyll Core
High-performance async web scraping engine designed for AI data collection.
§Features
- Async HTTP Fetching: Connection pooling, compression, retries with exponential backoff
- Crawl Management: URL normalization (RFC 3986), frontier scheduling, deduplication
- Politeness: robots.txt (RFC 9309), adaptive rate limiting per domain
- Content Extraction: Text, links, images, videos, structured data (JSON-LD, OpenGraph)
- Security: SSRF protection, domain allowlists, resource limits
- Storage: WARC (ISO 28500), snapshots with content hashing
- Observability: Structured logging, metrics, distributed tracing
§Example
use halldyll_core::{Orchestrator, Config};
use url::Url;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = Config::default();
let orchestrator = Orchestrator::new(config)?;
let url = Url::parse("https://example.com")?;
let result = orchestrator.scrape(&url).await?;
println!("Title: {:?}", result.document.title);
println!("Text length: {}", result.document.main_text.len());
Ok(())
}Re-exports§
pub use types::Document;pub use types::Assets;pub use types::Provenance;pub use types::Error;pub use types::Config;pub use types::error::Result;pub use orchestrator::Orchestrator;pub use fetch::CircuitBreaker;pub use fetch::CircuitBreakerConfig;pub use observe::HealthChecker;pub use observe::HealthResponse;pub use observe::HealthStatus;pub use observe::HealthMetrics;pub use observe::PrometheusExporter;pub use observe::MetricsCollector;pub use observe::MetricsSnapshot;pub use observe::GracefulShutdown;pub use observe::ShutdownResult;
Modules§
- crawl
- Crawl - Frontier, normalization, deduplication
- fetch
- HTTP Fetcher - Robust web page fetching
- observe
- Observe - Logs, metrics, traces, health checks, graceful shutdown
- orchestrator
- Orchestrator - Main scraper orchestration with full component integration
- parse
- Parse - Content extraction (text, links, images, videos, metadata)
- politeness
- Politeness - Robots.txt and throttling
- render
- Render - JS rendering decision and headless browser integration
- security
- Security - Validation and protection
- sitemap
- Sitemap - Sitemap parsing
- storage
- Storage - Storage of snapshots and normalized documents
- types
- Core data types for the scraper