๐ท๏ธ Halldyll
High-performance async web scraper written in Rust with Python bindings, designed for AI agents and cloud deployments.
โจ Features
- ๐ Blazing Fast - Async Tokio runtime, connection pooling, 3-6x faster than Python alternatives
- ๐ Memory Safe - Zero unsafe code, guaranteed by Rust's ownership model
- โ๏ธ Polite Crawling - RFC 9309 robots.txt compliance, adaptive rate limiting per domain
- ๐ Smart Extraction - Main text, JSON-LD, OpenGraph, media assets, outbound links
- ๐ Content Dedup - URL normalization and SimHash-based content deduplication
- ๐ JS Rendering - Optional Chromium pool for JavaScript-heavy pages
- โ๏ธ Cloud Native - Kubernetes health probes, Prometheus metrics, graceful shutdown
- ๐ก๏ธ Resilient - Circuit breakers per domain, exponential backoff with jitter
- ๐ Python Bindings - Native PyO3 bindings with typed exceptions
๐ฆ Architecture
halldyll/
โโโ halldyll-core โ Orchestration, HTTP client, rate limiting, storage (63 tests)
โโโ halldyll-parser โ HTML parsing, text/link/metadata extraction (220 tests)
โโโ halldyll-media โ Image, video, audio, document extraction (118 tests)
โโโ halldyll-robots โ robots.txt parsing and caching (45 tests)
โโโ halldyll-python โ Python bindings via PyO3 (8 tests)
Total: 452 tests passing โ
๐ Quick Start
Rust
Add to your Cargo.toml:
[]
= "0.1"
= { = "1", = ["full"] }
use ;
use Url;
async
Python
# Simple one-liner
=
# With configuration
= # Production-ready settings
=
โ๏ธ Configuration Presets
| Preset | Use Case | Settings |
|---|---|---|
Config::default() |
General use | 2 concurrent/domain, 100ms delay, robots.txt on |
Config::cloud() |
Production/AI agents | 1 concurrent/domain, 1s delay, 30s timeout, metrics on |
Config::polite() |
Sensitive targets | 1 concurrent/domain, 3s delay, strict limits |
Config::fast() |
Dev/testing only | 10 concurrent/domain, no robots.txt โ ๏ธ |
Custom Configuration (Rust)
use Config;
let mut config = default;
// HTTP settings
config.fetch.user_agent = "MyBot/1.0".to_string;
config.fetch.total_timeout_ms = 30000;
config.fetch.max_retries = 3;
// Politeness
config.politeness.respect_robots_txt = true;
config.politeness.default_delay_ms = 1000;
config.politeness.max_concurrent_per_domain = 2;
// Extraction
config.parse.extract_json_ld = true;
config.parse.extract_images = true;
config.parse.segment_text = true;
config.parse.chunk_size = 1000;
// Security
config.security.block_private_ips = true;
config.security.max_response_size = 10 * 1024 * 1024; // 10MB
Custom Configuration (Python)
=
๐ค AI Agent Integration
With LangChain
"""Scrape a webpage and return its content."""
=
return f
return f
=
# Use in your agent
With CrewAI
=
=
=
=
With Azure Agent Framework
"""Scrape a webpage and extract its content."""
=
return
=
= await
Batch Processing for RAG
=
=
=
# Prepare for vector database
=
# Insert into your vector DB (Pinecone, Weaviate, Qdrant, etc.)
โ๏ธ Cloud Deployment
Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: halldyll-scraper
spec:
replicas: 3
template:
spec:
containers:
- name: scraper
image: your-registry/halldyll:latest
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
Health & Metrics Endpoints (Rust)
use ;
use Arc;
// Health checker
let health = default_config;
// GET /healthz - Liveness probe
let liveness = health.liveness;
// Returns: {"status": "healthy", "uptime_secs": 3600, ...}
// GET /readyz - Readiness probe
let metrics = HealthMetrics ;
let readiness = health.readiness;
// GET /metrics - Prometheus format
let collector = new;
let exporter = new;
let prometheus_output = exporter.export;
// Returns: halldyll_requests_total 1234
// halldyll_success_rate 0.98
// ...
// Graceful shutdown
let shutdown = new;
// On SIGTERM: shutdown.wait_for_completion().await
Circuit Breaker
use ;
// Production config: tolerant, slow recovery
let breaker = new;
// Before each request
if !breaker.allow_request
// After request
match result
// Monitor open circuits
let open = breaker.get_open_circuits;
println!;
๐ Python Exception Handling
=
# Retry with backoff
# Wait and retry
# Skip this URL
๐ Extraction Capabilities
| Feature | Description |
|---|---|
| Main Text | Boilerplate removal, clean content extraction |
| Title | Page title with fallbacks (og:title, h1) |
| Description | Meta description, og:description |
| JSON-LD | Structured data (Schema.org) |
| OpenGraph | Social media metadata |
| Images | URLs, dimensions, alt text, lazy-load resolution |
| Videos | YouTube, Vimeo, embedded videos |
| Audio | Podcast feeds, audio embeds |
| Links | Internal/external classification, anchor text |
| Canonical URL | Resolved canonical URL |
| Pagination | Next/prev page detection |
๐ง Advanced Usage
Standalone Crates
Each crate can be used independently:
# Just the parser
[]
= "0.1"
# Just robots.txt
[]
= "0.1"
# Just media extraction
[]
= "0.1"
// Use parser standalone
use HtmlParser;
let html = r#"<html><body><h1>Hello</h1><p>World</p></body></html>"#;
let parser = new;
let text = parser.extract_text;
let links = parser.extract_links;
Custom User Agent Rotation
let agents = vec!;
for in urls.iter.enumerate
๐ Performance
| Metric | Halldyll | Scrapy | Playwright |
|---|---|---|---|
| Speed (pages/min) | ~500 | ~150 | ~50 |
| Memory (10K pages) | ~50 MB | ~300 MB | ~800 MB |
| Startup time | <100ms | ~2s | ~5s |
๐งช Testing
# Run all tests (452 tests)
# Run with output
# Run specific crate
๐ Project Structure
halldyll-scrapper/
โโโ Cargo.toml # Rust workspace
โโโ crates/
โ โโโ halldyll-core/ # Core scraping engine
โ โ โโโ src/
โ โ โโโ fetch/ # HTTP client, circuit breaker
โ โ โโโ observe/ # Metrics, health, shutdown
โ โ โโโ storage/ # Dedup, content store
โ โ โโโ types/ # Config, errors
โ โโโ halldyll-parser/ # HTML extraction (220 tests)
โ โโโ halldyll-media/ # Media extraction (118 tests)
โ โโโ halldyll-robots/ # robots.txt (45 tests)
โ โโโ halldyll-python/ # PyO3 bindings
โโโ examples/ # Usage examples
โโโ README.md
๐ License
MIT License - see LICENSE file.
๐ค Author
Geryan Roy
- Email: geryan.roy@icloud.com
- GitHub: @Mr-soloDev
๐ Links
- Repository: github.com/Mr-soloDev/halldyll-Scrapper
- Crates.io: crates.io/crates/halldyll-core
- Documentation: docs.rs/halldyll-core