🕷️ Rust Scraper

Production-ready web scraper with Clean Architecture, TUI selector, and sitemap support.

✨ Features

🚀 Core

Async Web Scraping: Multi-threaded with Tokio runtime
Sitemap Support: Zero-allocation streaming parser
- Gzip decompression (.xml.gz)
- Sitemap index recursion (max depth 3)
- Auto-discovery from robots.txt
TUI Interactivo: Select URLs before downloading
- Checkbox selection ([✅] / [⬜])
- Keyboard navigation (↑↓, Space, Enter)
- Confirmation mode (Y/N)

🏗️ Architecture

Clean Architecture: Domain → Application → Infrastructure → Adapters
Error Handling: thiserror for libraries, anyhow for applications
Dependency Injection: HTTP client, user agents, concurrency config

⚡ Performance

True Streaming: Constant ~8KB RAM, no OOM
Zero-Allocation Parsing: quick-xml for sitemaps
LazyLock Cache: Syntax highlighting (2-10ms → ~0.01ms)
Bounded Concurrency: Configurable parallel downloads

🔒 Security

SSRF Prevention: URL host comparison (not string contains)
Windows Safe: Reserved names blocked (CON → CON_safe)
WAF Bypass Prevention: Chrome 131+ UAs with TTL caching
RFC 3986 URLs: url::Url::parse() validation

📦 Installation

From Source

git clone https://github.com/XaviCode1000/rust-scraper.git
cd rust-scraper
cargo build --release

The binary will be available at target/release/rust_scraper.

From Cargo (coming soon)

cargo install rust_scraper

🚀 Usage

Basic (Headless Mode)

# Scrape all URLs from a website
./target/release/rust_scraper --url https://example.com

# With sitemap (auto-discovers from robots.txt)
./target/release/rust_scraper --url https://example.com --use-sitemap

# Explicit sitemap URL
./target/release/rust_scraper --url https://example.com \
  --use-sitemap \
  --sitemap-url https://example.com/sitemap.xml.gz

Interactive Mode (TUI)

# Select URLs interactively before downloading
./target/release/rust_scraper --url https://example.com --interactive

# With sitemap
./target/release/rust_scraper --url https://example.com \
  --interactive \
  --use-sitemap

TUI Controls

Key	Action
`↑↓`	Navigate URLs
`Space`	Toggle selection
`A`	Select all
`D`	Deselect all
`Enter`	Confirm download
`Y/N`	Final confirmation
`q`	Quit

Advanced Options

# Full example with all options
./target/release/rust_scraper \
  --url https://example.com \
  --output ./output \
  --format markdown \
  --download-images \
  --download-documents \
  --use-sitemap \
  --concurrency 5 \
  --delay-ms 1000 \
  --max-pages 100 \
  --verbose

RAG Export Pipeline (JSONL Format)

Export content in JSON Lines format, optimized for RAG (Retrieval-Augmented Generation) pipelines.

# Export to JSONL (one JSON object per line)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data

# Resume interrupted scraping (skip already processed URLs)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data --resume

# Custom state directory (isolate state per project)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data --state-dir ./state --resume

JSONL Schema

Each line is a valid JSON object with the following structure:

{
  "id": "uuid-v4",
  "url": "https://example.com/page",
  "title": "Page Title",
  "content": "Extracted content...",
  "metadata": {
    "domain": "example.com",
    "excerpt": "Meta description or excerpt"
  },
  "timestamp": "2026-03-09T10:00:00Z"
}

State Management

Location: ~/.cache/rust-scraper/state/<domain>.json
Tracks: Processed URLs, timestamps, status
Atomic saves: Write to tmp + rename (crash-safe)
Resume mode: --resume flag enables state tracking

RAG Integration

JSONL format is compatible with:

Qdrant: Load via Python SDK
Weaviate: Batch import
Pinecone: Upsert from JSONL
LangChain: JSONLoader for document loading

# Example: Load JSONL with LangChain
from langchain.document_loaders import JSONLoader

loader = JSONLoader(
    file_path='./rag_data/export.jsonl',
    jq_schema='.content',
    text_content=False
)
documents = loader.load()

Get Help

./target/release/rust_scraper --help

📖 Documentation

Usage Guide - Detailed examples and troubleshooting
Architecture - Clean Architecture details
API Docs - Rust documentation

🧪 Testing

# Run all tests
cargo test

# Run with output
cargo test -- --nocapture

# Run specific test
cargo test test_validate_and_parse_url

Tests: 216 passing ✅

🏗️ Architecture

Domain (entities, errors)
    ↓
Application (services, use cases)
    ↓
Infrastructure (HTTP, parsers, converters)
    ↓
Adapters (TUI, CLI, detectors)

Dependency Rule: Dependencies point inward. Domain never imports frameworks.

See docs/ARCHITECTURE.md for detailed architecture documentation.

🔧 Development

Requirements

Rust 1.80+
Cargo

Build

# Debug build
cargo build

# Release build (optimized)
cargo build --release

Lint

# Run Clippy (deny warnings)
cargo clippy -- -D warnings

# Check formatting
cargo fmt --all -- --check

Run

# Run in debug mode
cargo run -- --url https://example.com

# Run in release mode
cargo run --release -- --url https://example.com

📄 License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE)
MIT license (LICENSE-MIT)

at your option.

🙏 Acknowledgments

Built with Clean Architecture principles
Inspired by ripgrep performance patterns
Uses rust-skills (179 rules)

📊 Stats

Lines of Code: ~4000+
Tests: 216 passing
Coverage: High (state-based testing)
MSRV: 1.80.0

🗺️ Roadmap

v1.1.0: Multi-domain crawling
v1.2.0: JavaScript rendering (headless browser)
v2.0.0: Distributed scraping

Made with ❤️ using Rust and Clean Architecture

rust_scraper 1.0.0

🕷️ Rust Scraper

✨ Features

🚀 Core

🏗️ Architecture

⚡ Performance

🔒 Security

📦 Installation

From Source

From Cargo (coming soon)

🚀 Usage

Basic (Headless Mode)

Interactive Mode (TUI)

TUI Controls

Advanced Options

RAG Export Pipeline (JSONL Format)

JSONL Schema

State Management

RAG Integration

Get Help

📖 Documentation

🧪 Testing

🏗️ Architecture

🔧 Development

Requirements

Build

Lint

Run

📄 License

🙏 Acknowledgments

📊 Stats

🗺️ Roadmap