Argus Web Crawler
A production-ready web crawler written in Rust, capable of handling billions of URLs with advanced features like content deduplication, distributed crawling, and JavaScript rendering.
β‘ Quick Start
Installation
π¦ Cargo (Recommended)
πΊ Homebrew (macOS)
π§ Snap (Linux)
πͺ Chocolatey (Windows)
π³ Docker
Basic Usage
# Simple crawl
# Distributed crawling with Redis
# JavaScript rendering (build with js-render feature)
π Features
Core Features
- β Robust Error Handling - Automatic retry with exponential backoff
- β Robots.txt Compliance - Full respect for crawl rules
- β Graceful Shutdown - Clean interruption on SIGTERM/SIGINT
- β Rate Limiting - Configurable delays per domain
- β Content Limits - Size limits for HTML, text, and binary content
Advanced Features
- π Content Deduplication - Simhash-based near-duplicate detection
- π JavaScript Rendering - Headless Chrome support for SPAs
- π Metadata Extraction - Canonical URLs, hreflang, meta tags
- πΊοΈ Sitemap Parsing - Auto-discovery and parsing of sitemaps
- π¦ Multiple Storage Backends - File system or S3-compatible storage
Scalability Features
- π§ Bloom Filter Deduplication - 1B URLs in only 1.2GB RAM
- π Distributed Crawling - Redis-based coordination
- π Redis Streams - High-throughput job distribution
- βοΈ Object Storage - Unlimited scaling with S3/MinIO
π Performance
| Metric | Single Node | Distributed (10 nodes) |
|---|---|---|
| URLs/second | 100-1000 | 1000-10000 |
| Memory (1B URLs) | 1.2GB (Bloom) | 1.2GB per node |
| Storage | Local disk | S3 (unlimited) |
| Network | 1 Gbps | 10 Gbps+ |
ποΈ Architecture
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Frontier β β Fetcher β β Parser β
β β β β β β
β β’ URL Queue βββββΆβ β’ HTTP Client βββββΆβ β’ HTML Parser β
β β’ Prioritizationβ β β’ Retry Logic β β β’ Link Extract β
β β’ Deduplication β β β’ Rate Limit β β β’ Metadata β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Deduplication β β Storage β β Robots.txt β
β β β β β β
β β’ Bloom Filter β β β’ File System β β β’ Parser β
β β’ Simhash β β β’ S3/MinIO β β β’ Cache β
β β’ Redis β β β’ Metadata β β β’ Rules β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
π Documentation
π‘ Examples
Basic Crawling
use run_crawl;
async
Distributed Crawling
use StreamFrontier;
use HybridSeenSet;
use S3Storage;
// Redis Streams for job distribution
let frontier = new.await?;
// Bloom filter + Redis for deduplication
let seen = new.await?;
// S3 for unlimited storage
let storage = new.await?;
JavaScript Rendering
# Build with JS support
# Crawl SPA sites
π οΈ Development
Setup
Features
redis- Enable Redis support (default)s3- Enable S3 storagejs-render- Enable JavaScript renderingall-features- Enable everything
# Build with all features
# Run tests with all features
π¦ Crates
This is a workspace with the following crates:
- argus-cli - Command-line interface
- argus-common - Common types and utilities
- argus-fetcher - HTTP fetching with retry logic
- argus-parser - HTML and sitemap parsing
- argus-dedupe - Content deduplication with Simhash
- argus-storage - Storage backends
- argus-frontier - URL frontier implementations
- argus-robots - Robots.txt parsing
- argus-worker - Worker implementation
- argus-config - Configuration management
π³ Docker
Basic Usage
# Pull image
# Run crawl
With Docker Compose
version: '3.8'
services:
redis:
image: redis:7-alpine
ports:
- "6379:6379"
argus:
image: yourusername/argus:latest
command: crawl --redis-url redis://redis:6379
volumes:
- ./data:/data
depends_on:
- redis
π€ Contributing
Contributions are welcome! Please read our Contributing Guide.
Quick Start
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- Built with Rust
- Inspired by Scrapy and Nutch
- Icons by Feather Icons
π Links
Made with β€οΈ by the Argus contributors