# π·οΈ Rust Scraper
**Production-ready web scraper with Clean Architecture, TUI selector, and sitemap support.**
[](https://github.com/XaviCode1000/rust-scraper/actions)
[](https://github.com/XaviCode1000/rust-scraper)
[](LICENSE)
[](https://www.rust-lang.org)
[](https://github.com/XaviCode1000/rust-scraper/releases)
## β¨ Features
### π Core
- **Async Web Scraping**: Multi-threaded with Tokio runtime
- **Sitemap Support**: Zero-allocation streaming parser
- Gzip decompression (`.xml.gz`)
- Sitemap index recursion (max depth 3)
- Auto-discovery from `robots.txt`
- **TUI Interactivo**: Select URLs before downloading
- Checkbox selection (`[β
]` / `[β¬]`)
- Keyboard navigation (ββ, Space, Enter)
- Confirmation mode (Y/N)
### ποΈ Architecture
- **Clean Architecture**: Domain β Application β Infrastructure β Adapters
- **Error Handling**: `thiserror` for libraries, `anyhow` for applications
- **Dependency Injection**: HTTP client, user agents, concurrency config
### β‘ Performance
- **True Streaming**: Constant ~8KB RAM, no OOM
- **Zero-Allocation Parsing**: `quick-xml` for sitemaps
- **LazyLock Cache**: Syntax highlighting (2-10ms β ~0.01ms)
- **Bounded Concurrency**: Configurable parallel downloads
### π Security
- **SSRF Prevention**: URL host comparison (not string contains)
- **Windows Safe**: Reserved names blocked (`CON` β `CON_safe`)
- **WAF Bypass Prevention**: Chrome 131+ UAs with TTL caching
- **RFC 3986 URLs**: `url::Url::parse()` validation
## π¦ Installation
### From Source
```bash
git clone https://github.com/XaviCode1000/rust-scraper.git
cd rust-scraper
cargo build --release
```
The binary will be available at `target/release/rust_scraper`.
### From Cargo (coming soon)
```bash
cargo install rust_scraper
```
## π Usage
### Basic (Headless Mode)
```bash
# Scrape all URLs from a website
./target/release/rust_scraper --url https://example.com
# With sitemap (auto-discovers from robots.txt)
./target/release/rust_scraper --url https://example.com --use-sitemap
# Explicit sitemap URL
./target/release/rust_scraper --url https://example.com \
--use-sitemap \
--sitemap-url https://example.com/sitemap.xml.gz
```
### Interactive Mode (TUI)
```bash
# Select URLs interactively before downloading
./target/release/rust_scraper --url https://example.com --interactive
# With sitemap
./target/release/rust_scraper --url https://example.com \
--interactive \
--use-sitemap
```
### TUI Controls
| `ββ` | Navigate URLs |
| `Space` | Toggle selection |
| `A` | Select all |
| `D` | Deselect all |
| `Enter` | Confirm download |
| `Y/N` | Final confirmation |
| `q` | Quit |
### Advanced Options
```bash
# Full example with all options
./target/release/rust_scraper \
--url https://example.com \
--output ./output \
--format markdown \
--download-images \
--download-documents \
--use-sitemap \
--concurrency 5 \
--delay-ms 1000 \
--max-pages 100 \
--verbose
```
### RAG Export Pipeline (JSONL Format)
Export content in JSON Lines format, optimized for RAG (Retrieval-Augmented Generation) pipelines.
```bash
# Export to JSONL (one JSON object per line)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data
# Resume interrupted scraping (skip already processed URLs)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data --resume
# Custom state directory (isolate state per project)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data --state-dir ./state --resume
```
#### JSONL Schema
Each line is a valid JSON object with the following structure:
```json
{
"id": "uuid-v4",
"url": "https://example.com/page",
"title": "Page Title",
"content": "Extracted content...",
"metadata": {
"domain": "example.com",
"excerpt": "Meta description or excerpt"
},
"timestamp": "2026-03-09T10:00:00Z"
}
```
#### State Management
- **Location**: `~/.cache/rust-scraper/state/<domain>.json`
- **Tracks**: Processed URLs, timestamps, status
- **Atomic saves**: Write to tmp + rename (crash-safe)
- **Resume mode**: `--resume` flag enables state tracking
#### RAG Integration
JSONL format is compatible with:
- **Qdrant**: Load via Python SDK
- **Weaviate**: Batch import
- **Pinecone**: Upsert from JSONL
- **LangChain**: `JSONLoader` for document loading
```python
# Example: Load JSONL with LangChain
from langchain.document_loaders import JSONLoader
loader = JSONLoader(
file_path='./rag_data/export.jsonl',
jq_schema='.content',
text_content=False
)
documents = loader.load()
```
### Get Help
```bash
./target/release/rust_scraper --help
```
## π Documentation
- [**Usage Guide**](docs/USAGE.md) - Detailed examples and troubleshooting
- [**Architecture**](docs/ARCHITECTURE.md) - Clean Architecture details
- [**API Docs**](https://docs.rs/rust_scraper) - Rust documentation
## π§ͺ Testing
```bash
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
# Run specific test
cargo test test_validate_and_parse_url
```
**Tests:** 216 passing β
## ποΈ Architecture
```
Domain (entities, errors)
β
Application (services, use cases)
β
Infrastructure (HTTP, parsers, converters)
β
Adapters (TUI, CLI, detectors)
```
**Dependency Rule:** Dependencies point inward. Domain never imports frameworks.
See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed architecture documentation.
## π§ Development
### Requirements
- Rust 1.80+
- Cargo
### Build
```bash
# Debug build
cargo build
# Release build (optimized)
cargo build --release
```
### Lint
```bash
# Run Clippy (deny warnings)
cargo clippy -- -D warnings
# Check formatting
cargo fmt --all -- --check
```
### Run
```bash
# Run in debug mode
cargo run -- --url https://example.com
# Run in release mode
cargo run --release -- --url https://example.com
```
## π License
Licensed under either of:
- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE))
- MIT license ([LICENSE-MIT](LICENSE-MIT))
at your option.
## π Acknowledgments
- Built with [Clean Architecture](https://blog.cleancoder.com/uncle-bob/2012/08/13/the-clean-architecture.html) principles
- Inspired by [ripgrep](https://github.com/BurntSushi/ripgrep) performance patterns
- Uses [rust-skills](https://github.com/leonardomso/rust-skills) (179 rules)
## π Stats
- **Lines of Code:** ~4000+
- **Tests:** 216 passing
- **Coverage:** High (state-based testing)
- **MSRV:** 1.80.0
## πΊοΈ Roadmap
- [ ] v1.1.0: Multi-domain crawling
- [ ] v1.2.0: JavaScript rendering (headless browser)
- [ ] v2.0.0: Distributed scraping
---
**Made with β€οΈ using Rust and Clean Architecture**