1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
//! # Halldyll Core
//!
//! High-performance async web scraping engine designed for AI data collection.
//!
//! ## Features
//!
//! - **Async HTTP Fetching**: Connection pooling, compression, retries with exponential backoff
//! - **Crawl Management**: URL normalization (RFC 3986), frontier scheduling, deduplication
//! - **Politeness**: robots.txt (RFC 9309), adaptive rate limiting per domain
//! - **Content Extraction**: Text, links, images, videos, structured data (JSON-LD, OpenGraph)
//! - **Security**: SSRF protection, domain allowlists, resource limits
//! - **Storage**: WARC (ISO 28500), snapshots with content hashing
//! - **Observability**: Structured logging, metrics, distributed tracing
//!
//! ## Example
//!
//! ```rust,no_run
//! use halldyll_core::{Orchestrator, Config};
//! use url::Url;
//!
//! #[tokio::main]
//! async fn main() -> Result<(), Box<dyn std::error::Error>> {
//! let config = Config::default();
//! let orchestrator = Orchestrator::new(config)?;
//!
//! let url = Url::parse("https://example.com")?;
//! let result = orchestrator.scrape(&url).await?;
//!
//! println!("Title: {:?}", result.document.title);
//! println!("Text length: {}", result.document.main_text.len());
//! Ok(())
//! }
//! ```
// Re-exports for convenience
pub use ;
pub use Result;
pub use Orchestrator;
// Production-ready utilities
pub use ;
pub use ;