π·οΈ Rust Scraper
Production-ready web scraper with Clean Architecture, TUI selector, and sitemap support.
β¨ Features
π Core
- Async Web Scraping: Multi-threaded with Tokio runtime
- Sitemap Support: Zero-allocation streaming parser
- Gzip decompression (
.xml.gz) - Sitemap index recursion (max depth 3)
- Auto-discovery from
robots.txt
- Gzip decompression (
- TUI Interactivo: Select URLs before downloading
- Checkbox selection (
[β ]/[β¬]) - Keyboard navigation (ββ, Space, Enter)
- Confirmation mode (Y/N)
- Checkbox selection (
ποΈ Architecture
- Clean Architecture: Domain β Application β Infrastructure β Adapters
- Error Handling:
thiserrorfor libraries,anyhowfor applications - Dependency Injection: HTTP client, user agents, concurrency config
β‘ Performance
- True Streaming: Constant ~8KB RAM, no OOM
- Zero-Allocation Parsing:
quick-xmlfor sitemaps - LazyLock Cache: Syntax highlighting (2-10ms β ~0.01ms)
- Bounded Concurrency: Configurable parallel downloads
π Security
- SSRF Prevention: URL host comparison (not string contains)
- Windows Safe: Reserved names blocked (
CONβCON_safe) - WAF Bypass Prevention: Chrome 131+ UAs with TTL caching
- RFC 3986 URLs:
url::Url::parse()validation
π¦ Installation
From Source
The binary will be available at target/release/rust_scraper.
From Cargo (coming soon)
π Usage
Basic (Headless Mode)
# Scrape all URLs from a website
# With sitemap (auto-discovers from robots.txt)
# Explicit sitemap URL
Interactive Mode (TUI)
# Select URLs interactively before downloading
# With sitemap
TUI Controls
| Key | Action |
|---|---|
ββ |
Navigate URLs |
Space |
Toggle selection |
A |
Select all |
D |
Deselect all |
Enter |
Confirm download |
Y/N |
Final confirmation |
q |
Quit |
Advanced Options
# Full example with all options
RAG Export Pipeline (JSONL Format)
Export content in JSON Lines format, optimized for RAG (Retrieval-Augmented Generation) pipelines.
# Export to JSONL (one JSON object per line)
# Resume interrupted scraping (skip already processed URLs)
# Custom state directory (isolate state per project)
JSONL Schema
Each line is a valid JSON object with the following structure:
State Management
- Location:
~/.cache/rust-scraper/state/<domain>.json - Tracks: Processed URLs, timestamps, status
- Atomic saves: Write to tmp + rename (crash-safe)
- Resume mode:
--resumeflag enables state tracking
RAG Integration
JSONL format is compatible with:
- Qdrant: Load via Python SDK
- Weaviate: Batch import
- Pinecone: Upsert from JSONL
- LangChain:
JSONLoaderfor document loading
# Example: Load JSONL with LangChain
=
=
Get Help
π Documentation
- Usage Guide - Detailed examples and troubleshooting
- Architecture - Clean Architecture details
- API Docs - Rust documentation
π§ͺ Testing
# Run all tests
# Run with output
# Run specific test
Tests: 216 passing β
ποΈ Architecture
Domain (entities, errors)
β
Application (services, use cases)
β
Infrastructure (HTTP, parsers, converters)
β
Adapters (TUI, CLI, detectors)
Dependency Rule: Dependencies point inward. Domain never imports frameworks.
See docs/ARCHITECTURE.md for detailed architecture documentation.
π§ Development
Requirements
- Rust 1.80+
- Cargo
Build
# Debug build
# Release build (optimized)
Lint
# Run Clippy (deny warnings)
# Check formatting
Run
# Run in debug mode
# Run in release mode
π License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.
π Acknowledgments
- Built with Clean Architecture principles
- Inspired by ripgrep performance patterns
- Uses rust-skills (179 rules)
π Stats
- Lines of Code: ~4000+
- Tests: 216 passing
- Coverage: High (state-based testing)
- MSRV: 1.80.0
πΊοΈ Roadmap
- v1.1.0: Multi-domain crawling
- v1.2.0: JavaScript rendering (headless browser)
- v2.0.0: Distributed scraping
Made with β€οΈ using Rust and Clean Architecture