argus-common 0.1.0

Common types and utilities for the Argus web crawler
Documentation

Argus Web Crawler

Crates.io Documentation Build Status License: MIT

A production-ready web crawler written in Rust, capable of handling billions of URLs with advanced features like content deduplication, distributed crawling, and JavaScript rendering.

⚑ Quick Start

Installation

πŸ“¦ Cargo (Recommended)

cargo install argus-cli

🍺 Homebrew (macOS)

brew install argus

🐧 Snap (Linux)

snap install argus

πŸͺŸ Chocolatey (Windows)

choco install argus

🐳 Docker

docker run yourusername/argus crawl --seed-url https://example.com

Basic Usage

# Simple crawl
argus crawl --seed-url https://example.com --storage-dir ./data

# Distributed crawling with Redis
argus crawl --redis-url redis://localhost:6379 --workers 8

# JavaScript rendering (build with js-render feature)
argus crawl --seed-url https://spa-example.com --js-render

πŸš€ Features

Core Features

  • βœ… Robust Error Handling - Automatic retry with exponential backoff
  • βœ… Robots.txt Compliance - Full respect for crawl rules
  • βœ… Graceful Shutdown - Clean interruption on SIGTERM/SIGINT
  • βœ… Rate Limiting - Configurable delays per domain
  • βœ… Content Limits - Size limits for HTML, text, and binary content

Advanced Features

  • πŸ”„ Content Deduplication - Simhash-based near-duplicate detection
  • 🌐 JavaScript Rendering - Headless Chrome support for SPAs
  • πŸ“Š Metadata Extraction - Canonical URLs, hreflang, meta tags
  • πŸ—ΊοΈ Sitemap Parsing - Auto-discovery and parsing of sitemaps
  • πŸ“¦ Multiple Storage Backends - File system or S3-compatible storage

Scalability Features

  • 🧠 Bloom Filter Deduplication - 1B URLs in only 1.2GB RAM
  • πŸ”€ Distributed Crawling - Redis-based coordination
  • 🌊 Redis Streams - High-throughput job distribution
  • ☁️ Object Storage - Unlimited scaling with S3/MinIO

πŸ“ˆ Performance

Metric Single Node Distributed (10 nodes)
URLs/second 100-1000 1000-10000
Memory (1B URLs) 1.2GB (Bloom) 1.2GB per node
Storage Local disk S3 (unlimited)
Network 1 Gbps 10 Gbps+

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Frontier      β”‚    β”‚    Fetcher      β”‚    β”‚    Parser       β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β”‚ β€’ URL Queue     │───▢│ β€’ HTTP Client   │───▢│ β€’ HTML Parser   β”‚
β”‚ β€’ Prioritizationβ”‚    β”‚ β€’ Retry Logic   β”‚    β”‚ β€’ Link Extract  β”‚
β”‚ β€’ Deduplication β”‚    β”‚ β€’ Rate Limit    β”‚    β”‚ β€’ Metadata      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚                       β”‚
         β–Ό                       β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Deduplication β”‚    β”‚    Storage      β”‚    β”‚   Robots.txt   β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β”‚ β€’ Bloom Filter  β”‚    β”‚ β€’ File System   β”‚    β”‚ β€’ Parser        β”‚
β”‚ β€’ Simhash       β”‚    β”‚ β€’ S3/MinIO      β”‚    β”‚ β€’ Cache         β”‚
β”‚ β€’ Redis         β”‚    β”‚ β€’ Metadata      β”‚    β”‚ β€’ Rules         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“š Documentation

πŸ’‘ Examples

Basic Crawling

use argus_cli::run_crawl;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    run_crawl(&[
        "crawl",
        "--seed-url", "https://example.com",
        "--max-depth", "3",
        "--storage-dir", "./data"
    ]).await
}

Distributed Crawling

use argus_frontier::StreamFrontier;
use argus_dedupe::HybridSeenSet;
use argus_storage::S3Storage;

// Redis Streams for job distribution
let frontier = StreamFrontier::new(
    "redis://localhost:6379",
    Some("argus:jobs".to_string()),
    Some("workers".to_string()),
    "worker-1".to_string()
).await?;

// Bloom filter + Redis for deduplication
let seen = HybridSeenSet::new(
    "redis://localhost:6379",
    None,
    1_000_000_000, // 1B URLs
    0.01 // 1% false positive rate
).await?;

// S3 for unlimited storage
let storage = S3Storage::new(
    "my-crawl-bucket".to_string(),
    Some("crawl/".to_string())
).await?;

JavaScript Rendering

# Build with JS support
cargo build --release --features js-render

# Crawl SPA sites
argus crawl \
  --seed-url https://react-app.com \
  --js-render \
  --wait-for-selector "#content"

πŸ› οΈ Development

Setup

git clone https://github.com/yourusername/argus.git
cd argus
cargo build
cargo test

Features

  • redis - Enable Redis support (default)
  • s3 - Enable S3 storage
  • js-render - Enable JavaScript rendering
  • all-features - Enable everything
# Build with all features
cargo build --all-features

# Run tests with all features
cargo test --all-features

πŸ“¦ Crates

This is a workspace with the following crates:

🐳 Docker

Basic Usage

# Pull image
docker pull yourusername/argus:latest

# Run crawl
docker run -v $(pwd)/data:/data yourusername/argus \
  crawl --seed-url https://example.com --storage-dir /data

With Docker Compose

version: '3.8'
services:
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
  
  argus:
    image: yourusername/argus:latest
    command: crawl --redis-url redis://redis:6379
    volumes:
      - ./data:/data
    depends_on:
      - redis

🀝 Contributing

Contributions are welcome! Please read our Contributing Guide.

Quick Start

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

πŸ”— Links


⭐ Star us on GitHub!

Made with ❀️ by the Argus contributors