argus-crawler 0.1.0

A production-ready web crawler capable of handling billions of URLs
# Argus Web Crawler

<div align="center">

[![Crates.io](https://img.shields.io/crates/v/argus-cli.svg)](https://crates.io/crates/argus-cli)
[![Documentation](https://docs.rs/argus-cli/badge.svg)](https://docs.rs/argus-cli)
[![Build Status](https://github.com/yourusername/argus/workflows/CI/badge.svg)](https://github.com/yourusername/argus/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A production-ready web crawler written in Rust, capable of handling **billions of URLs** with advanced features like content deduplication, distributed crawling, and JavaScript rendering.

</div>

## ⚑ Quick Start

### Installation

#### πŸ“¦ Cargo (Recommended)
```bash
cargo install argus-cli
```

#### 🍺 Homebrew (macOS)
```bash
brew install argus
```

#### 🐧 Snap (Linux)
```bash
snap install argus
```

#### πŸͺŸ Chocolatey (Windows)
```bash
choco install argus
```

#### 🐳 Docker
```bash
docker run yourusername/argus crawl --seed-url https://example.com
```

### Basic Usage

```bash
# Simple crawl
argus crawl --seed-url https://example.com --storage-dir ./data

# Distributed crawling with Redis
argus crawl --redis-url redis://localhost:6379 --workers 8

# JavaScript rendering (build with js-render feature)
argus crawl --seed-url https://spa-example.com --js-render
```

## πŸš€ Features

### Core Features
- βœ… **Robust Error Handling** - Automatic retry with exponential backoff
- βœ… **Robots.txt Compliance** - Full respect for crawl rules
- βœ… **Graceful Shutdown** - Clean interruption on SIGTERM/SIGINT
- βœ… **Rate Limiting** - Configurable delays per domain
- βœ… **Content Limits** - Size limits for HTML, text, and binary content

### Advanced Features
- πŸ”„ **Content Deduplication** - Simhash-based near-duplicate detection
- 🌐 **JavaScript Rendering** - Headless Chrome support for SPAs
- πŸ“Š **Metadata Extraction** - Canonical URLs, hreflang, meta tags
- πŸ—ΊοΈ **Sitemap Parsing** - Auto-discovery and parsing of sitemaps
- πŸ“¦ **Multiple Storage Backends** - File system or S3-compatible storage

### Scalability Features
- 🧠 **Bloom Filter Deduplication** - 1B URLs in only 1.2GB RAM
- πŸ”€ **Distributed Crawling** - Redis-based coordination
- 🌊 **Redis Streams** - High-throughput job distribution
- ☁️ **Object Storage** - Unlimited scaling with S3/MinIO

## πŸ“ˆ Performance

| Metric | Single Node | Distributed (10 nodes) |
|--------|-------------|------------------------|
| URLs/second | 100-1000 | 1000-10000 |
| Memory (1B URLs) | 1.2GB (Bloom) | 1.2GB per node |
| Storage | Local disk | S3 (unlimited) |
| Network | 1 Gbps | 10 Gbps+ |

## πŸ—οΈ Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Frontier      β”‚    β”‚    Fetcher      β”‚    β”‚    Parser       β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β”‚ β€’ URL Queue     │───▢│ β€’ HTTP Client   │───▢│ β€’ HTML Parser   β”‚
β”‚ β€’ Prioritizationβ”‚    β”‚ β€’ Retry Logic   β”‚    β”‚ β€’ Link Extract  β”‚
β”‚ β€’ Deduplication β”‚    β”‚ β€’ Rate Limit    β”‚    β”‚ β€’ Metadata      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚                       β”‚
         β–Ό                       β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Deduplication β”‚    β”‚    Storage      β”‚    β”‚   Robots.txt   β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β”‚ β€’ Bloom Filter  β”‚    β”‚ β€’ File System   β”‚    β”‚ β€’ Parser        β”‚
β”‚ β€’ Simhash       β”‚    β”‚ β€’ S3/MinIO      β”‚    β”‚ β€’ Cache         β”‚
β”‚ β€’ Redis         β”‚    β”‚ β€’ Metadata      β”‚    β”‚ β€’ Rules         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## πŸ“š Documentation

- [API Documentation]https://docs.rs/argus-cli
- [Deployment Guide]DEPLOYMENT_GUIDE.md
- [Scaling to 1B URLs]docs/SCALING_GUIDE.md
- [Contributing]CONTRIBUTING.md

## πŸ’‘ Examples

### Basic Crawling
```rust
use argus_cli::run_crawl;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    run_crawl(&[
        "crawl",
        "--seed-url", "https://example.com",
        "--max-depth", "3",
        "--storage-dir", "./data"
    ]).await
}
```

### Distributed Crawling
```rust
use argus_frontier::StreamFrontier;
use argus_dedupe::HybridSeenSet;
use argus_storage::S3Storage;

// Redis Streams for job distribution
let frontier = StreamFrontier::new(
    "redis://localhost:6379",
    Some("argus:jobs".to_string()),
    Some("workers".to_string()),
    "worker-1".to_string()
).await?;

// Bloom filter + Redis for deduplication
let seen = HybridSeenSet::new(
    "redis://localhost:6379",
    None,
    1_000_000_000, // 1B URLs
    0.01 // 1% false positive rate
).await?;

// S3 for unlimited storage
let storage = S3Storage::new(
    "my-crawl-bucket".to_string(),
    Some("crawl/".to_string())
).await?;
```

### JavaScript Rendering
```bash
# Build with JS support
cargo build --release --features js-render

# Crawl SPA sites
argus crawl \
  --seed-url https://react-app.com \
  --js-render \
  --wait-for-selector "#content"
```

## πŸ› οΈ Development

### Setup
```bash
git clone https://github.com/yourusername/argus.git
cd argus
cargo build
cargo test
```

### Features
- `redis` - Enable Redis support (default)
- `s3` - Enable S3 storage
- `js-render` - Enable JavaScript rendering
- `all-features` - Enable everything

```bash
# Build with all features
cargo build --all-features

# Run tests with all features
cargo test --all-features
```

## πŸ“¦ Crates

This is a workspace with the following crates:

- [argus-cli]https://crates.io/crates/argus-cli - Command-line interface
- [argus-common]https://crates.io/crates/argus-common - Common types and utilities
- [argus-fetcher]https://crates.io/crates/argus-fetcher - HTTP fetching with retry logic
- [argus-parser]https://crates.io/crates/argus-parser - HTML and sitemap parsing
- [argus-dedupe]https://crates.io/crates/argus-dedupe - Content deduplication with Simhash
- [argus-storage]https://crates.io/crates/argus-storage - Storage backends
- [argus-frontier]https://crates.io/crates/argus-frontier - URL frontier implementations
- [argus-robots]https://crates.io/crates/argus-robots - Robots.txt parsing
- [argus-worker]https://crates.io/crates/argus-worker - Worker implementation
- [argus-config]https://crates.io/crates/argus-config - Configuration management

## 🐳 Docker

### Basic Usage
```bash
# Pull image
docker pull yourusername/argus:latest

# Run crawl
docker run -v $(pwd)/data:/data yourusername/argus \
  crawl --seed-url https://example.com --storage-dir /data
```

### With Docker Compose
```yaml
version: '3.8'
services:
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
  
  argus:
    image: yourusername/argus:latest
    command: crawl --redis-url redis://redis:6379
    volumes:
      - ./data:/data
    depends_on:
      - redis
```

## 🀝 Contributing

Contributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md).

### Quick Start
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Submit a pull request

## πŸ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## πŸ™ Acknowledgments

- Built with [Rust]https://www.rust-lang.org/
- Inspired by [Scrapy]https://scrapy.org/ and [Nutch]https://nutch.apache.org/
- Icons by [Feather Icons]https://feathericons.com/

## πŸ”— Links

- [Website]https://yourusername.github.io/argus
- [Documentation]https://docs.rs/argus-cli
- [Crates.io]https://crates.io/crates/argus-cli
- [Docker Hub]https://hub.docker.com/r/yourusername/argus
- [GitHub]https://github.com/yourusername/argus

---

<div align="center">

**[⭐ Star us on GitHub!](https://github.com/yourusername/argus)**

Made with ❀️ by the Argus contributors

</div>