# Argus Web Crawler
<div align="center">
[](https://crates.io/crates/argus-cli)
[](https://docs.rs/argus-cli)
[](https://github.com/yourusername/argus/actions)
[](https://opensource.org/licenses/MIT)
A production-ready web crawler written in Rust, capable of handling **billions of URLs** with advanced features like content deduplication, distributed crawling, and JavaScript rendering.
</div>
## β‘ Quick Start
### Installation
#### π¦ Cargo (Recommended)
```bash
cargo install argus-cli
```
#### πΊ Homebrew (macOS)
```bash
brew install argus
```
#### π§ Snap (Linux)
```bash
snap install argus
```
#### πͺ Chocolatey (Windows)
```bash
choco install argus
```
#### π³ Docker
```bash
docker run yourusername/argus crawl --seed-url https://example.com
```
### Basic Usage
```bash
# Simple crawl
argus crawl --seed-url https://example.com --storage-dir ./data
# Distributed crawling with Redis
argus crawl --redis-url redis://localhost:6379 --workers 8
# JavaScript rendering (build with js-render feature)
argus crawl --seed-url https://spa-example.com --js-render
```
## π Features
### Core Features
- β
**Robust Error Handling** - Automatic retry with exponential backoff
- β
**Robots.txt Compliance** - Full respect for crawl rules
- β
**Graceful Shutdown** - Clean interruption on SIGTERM/SIGINT
- β
**Rate Limiting** - Configurable delays per domain
- β
**Content Limits** - Size limits for HTML, text, and binary content
### Advanced Features
- π **Content Deduplication** - Simhash-based near-duplicate detection
- π **JavaScript Rendering** - Headless Chrome support for SPAs
- π **Metadata Extraction** - Canonical URLs, hreflang, meta tags
- πΊοΈ **Sitemap Parsing** - Auto-discovery and parsing of sitemaps
- π¦ **Multiple Storage Backends** - File system or S3-compatible storage
### Scalability Features
- π§ **Bloom Filter Deduplication** - 1B URLs in only 1.2GB RAM
- π **Distributed Crawling** - Redis-based coordination
- π **Redis Streams** - High-throughput job distribution
- βοΈ **Object Storage** - Unlimited scaling with S3/MinIO
## π Performance
| URLs/second | 100-1000 | 1000-10000 |
| Memory (1B URLs) | 1.2GB (Bloom) | 1.2GB per node |
| Storage | Local disk | S3 (unlimited) |
| Network | 1 Gbps | 10 Gbps+ |
## ποΈ Architecture
```
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Frontier β β Fetcher β β Parser β
β β β β β β
β β’ URL Queue βββββΆβ β’ HTTP Client βββββΆβ β’ HTML Parser β
β β’ Prioritizationβ β β’ Retry Logic β β β’ Link Extract β
β β’ Deduplication β β β’ Rate Limit β β β’ Metadata β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Deduplication β β Storage β β Robots.txt β
β β β β β β
β β’ Bloom Filter β β β’ File System β β β’ Parser β
β β’ Simhash β β β’ S3/MinIO β β β’ Cache β
β β’ Redis β β β’ Metadata β β β’ Rules β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
```
## π Documentation
- [API Documentation](https://docs.rs/argus-cli)
- [Deployment Guide](DEPLOYMENT_GUIDE.md)
- [Scaling to 1B URLs](docs/SCALING_GUIDE.md)
- [Contributing](CONTRIBUTING.md)
## π‘ Examples
### Basic Crawling
```rust
use argus_cli::run_crawl;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
run_crawl(&[
"crawl",
"--seed-url", "https://example.com",
"--max-depth", "3",
"--storage-dir", "./data"
]).await
}
```
### Distributed Crawling
```rust
use argus_frontier::StreamFrontier;
use argus_dedupe::HybridSeenSet;
use argus_storage::S3Storage;
// Redis Streams for job distribution
let frontier = StreamFrontier::new(
"redis://localhost:6379",
Some("argus:jobs".to_string()),
Some("workers".to_string()),
"worker-1".to_string()
).await?;
// Bloom filter + Redis for deduplication
let seen = HybridSeenSet::new(
"redis://localhost:6379",
None,
1_000_000_000, // 1B URLs
0.01 // 1% false positive rate
).await?;
// S3 for unlimited storage
let storage = S3Storage::new(
"my-crawl-bucket".to_string(),
Some("crawl/".to_string())
).await?;
```
### JavaScript Rendering
```bash
# Build with JS support
cargo build --release --features js-render
# Crawl SPA sites
argus crawl \
--seed-url https://react-app.com \
--js-render \
--wait-for-selector "#content"
```
## π οΈ Development
### Setup
```bash
git clone https://github.com/yourusername/argus.git
cd argus
cargo build
cargo test
```
### Features
- `redis` - Enable Redis support (default)
- `s3` - Enable S3 storage
- `js-render` - Enable JavaScript rendering
- `all-features` - Enable everything
```bash
# Build with all features
cargo build --all-features
# Run tests with all features
cargo test --all-features
```
## π¦ Crates
This is a workspace with the following crates:
- [argus-cli](https://crates.io/crates/argus-cli) - Command-line interface
- [argus-common](https://crates.io/crates/argus-common) - Common types and utilities
- [argus-fetcher](https://crates.io/crates/argus-fetcher) - HTTP fetching with retry logic
- [argus-parser](https://crates.io/crates/argus-parser) - HTML and sitemap parsing
- [argus-dedupe](https://crates.io/crates/argus-dedupe) - Content deduplication with Simhash
- [argus-storage](https://crates.io/crates/argus-storage) - Storage backends
- [argus-frontier](https://crates.io/crates/argus-frontier) - URL frontier implementations
- [argus-robots](https://crates.io/crates/argus-robots) - Robots.txt parsing
- [argus-worker](https://crates.io/crates/argus-worker) - Worker implementation
- [argus-config](https://crates.io/crates/argus-config) - Configuration management
## π³ Docker
### Basic Usage
```bash
# Pull image
docker pull yourusername/argus:latest
# Run crawl
docker run -v $(pwd)/data:/data yourusername/argus \
crawl --seed-url https://example.com --storage-dir /data
```
### With Docker Compose
```yaml
version: '3.8'
services:
redis:
image: redis:7-alpine
ports:
- "6379:6379"
argus:
image: yourusername/argus:latest
command: crawl --redis-url redis://redis:6379
volumes:
- ./data:/data
depends_on:
- redis
```
## π€ Contributing
Contributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md).
### Quick Start
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Submit a pull request
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π Acknowledgments
- Built with [Rust](https://www.rust-lang.org/)
- Inspired by [Scrapy](https://scrapy.org/) and [Nutch](https://nutch.apache.org/)
- Icons by [Feather Icons](https://feathericons.com/)
## π Links
- [Website](https://yourusername.github.io/argus)
- [Documentation](https://docs.rs/argus-cli)
- [Crates.io](https://crates.io/crates/argus-cli)
- [Docker Hub](https://hub.docker.com/r/yourusername/argus)
- [GitHub](https://github.com/yourusername/argus)
---
<div align="center">
**[β Star us on GitHub!](https://github.com/yourusername/argus)**
Made with β€οΈ by the Argus contributors
</div>