domain_status
domain_status is a Rust-based tool designed for high-performance concurrent checking of URL statuses and redirections. Built with async/await (Tokio), it processes URLs efficiently while capturing comprehensive metadata including TLS certificates, HTML content, DNS information, technology fingerprints, and redirect chains.
Table of Contents
- Quick Start
- Features
- Usage
- Configuration
- Data Captured
- Database Schema
- Output
- Performance
- Technical Details
- Architecture
- Development
- License
🚀 Quick Start
Installation
Option 1: Install via Cargo (Recommended for Rust users)
Requires Rust 1.83 or newer:
This compiles from source and installs the binary to ~/.cargo/bin/domain_status (or %USERPROFILE%\.cargo\bin\domain_status.exe on Windows). The binary is added to your PATH automatically.
Benefits:
- ✅ No macOS Gatekeeper warnings (compiled locally)
- ✅ Works on any platform Rust supports
- ✅ Easy updates:
cargo install --force domain_status - ✅ No manual downloads or extraction needed
Note: This crate requires Rust 1.83 or newer. If installation fails, update your Rust toolchain: rustup update stable.
Option 2: Download Pre-built Binary
Download the latest release from the Releases page:
# Linux (x86_64)
# macOS (Intel)
# macOS (Apple Silicon)
# macOS: Handle Gatekeeper warning (unsigned binary)
# Option 1: Right-click the binary, select "Open", then click "Open" in the dialog
# Option 2: Run this command to remove the quarantine attribute:
||
# Windows
# Download domain_status-windows-x86_64.exe.zip and extract
Option 3: Build from Source
Requires Rust (stable toolchain):
# Clone the repository
# Build release binary
This creates an executable in ./target/release/domain_status (or domain_status.exe on Windows).
Basic Usage
The tool will:
- Process URLs from the input file
- Store results in
./url_checker.db(SQLite) - Display progress and statistics
- Handle errors gracefully with automatic retries
🌟 Features
- High-Performance Concurrency: Async/await with configurable concurrency limits (default: 30 concurrent requests, 15 RPS)
- Comprehensive URL Analysis: Captures HTTP status, response times, HTML metadata, TLS certificates, DNS information, technology fingerprints, GeoIP location data, WHOIS registration data, structured data (JSON-LD, Open Graph, Twitter Cards), security warnings, and complete redirect chains
- Technology Fingerprinting: Detects web technologies using community-maintained Wappalyzer rulesets with JavaScript execution for dynamic detection
- GeoIP Lookup: Automatic geographic and network information lookup using MaxMind GeoLite2 databases (auto-downloads if license key provided)
- Enhanced DNS Analysis: Queries NS, TXT, and MX records; automatically extracts SPF and DMARC policies
- Enhanced TLS Analysis: Captures cipher suite and key algorithm in addition to certificate details
- Intelligent Error Handling: Automatic retries with exponential backoff, error rate monitoring with dynamic throttling, and comprehensive processing statistics
- Rate Limiting: Token-bucket rate limiting with adaptive adjustment based on error rates
- Robust Data Storage: SQLite database with WAL mode, UPSERT semantics, and unique constraints for idempotent processing
- Flexible Configuration: Extensive CLI options for logging, timeouts, concurrency, rate limits, database paths, and fingerprint rulesets
- Security Features: URL validation (http/https only), content-type filtering, response size limits, and redirect hop limits
📖 Usage
Command-Line Options
Common Options:
--log-level <LEVEL>: Log level:error,warn,info,debug, ortrace(default:info)--log-format <FORMAT>: Log format:plainorjson(default:plain)--db-path <PATH>: SQLite database file path (default:./url_checker.db)--max-concurrency <N>: Maximum concurrent requests (default: 30)--timeout-seconds <N>: HTTP client timeout in seconds (default: 10). Note: Per-URL processing timeout is 35 seconds.--rate-limit-rps <N>: Initial requests per second (adaptive rate limiting always enabled, default: 15)--show-timing: Display detailed timing metrics at the end of the run (default: disabled)--status-port <PORT>: Start HTTP status server on the specified port (optional, disabled by default)
Advanced Options:
--user-agent <STRING>: HTTP User-Agent header value (default: Chrome user agent)--fingerprints <URL|PATH>: Technology fingerprint ruleset source (URL or local path). Default: HTTP Archive Wappalyzer fork. Rules are cached locally for 7 days.--geoip <PATH|URL>: GeoIP database path (MaxMind GeoLite2 .mmdb file) or download URL. If not provided, will auto-download ifMAXMIND_LICENSE_KEYenvironment variable is set.--enable-whois: Enable WHOIS/RDAP lookup for domain registration information. WHOIS data is cached for 7 days. Default: disabled.
Example:
Monitoring Long-Running Jobs
For long-running jobs, you can monitor progress via an optional HTTP status server:
# Start with status server on port 8080
# In another terminal, check progress:
|
# Or view Prometheus metrics:
The status server provides:
- Real-time progress: Total URLs, completed, failed, percentage complete, processing rate
- Error breakdown: Detailed counts by error type
- Warning/info metrics: Track missing metadata, redirects, bot detection events
- Prometheus compatibility: Metrics endpoint ready for Prometheus scraping
See Status Endpoint and Metrics Endpoint sections below for detailed API documentation.
Environment Variables
MAXMIND_LICENSE_KEY: MaxMind license key for automatic GeoIP database downloads. Get a free key from MaxMind. If not set, GeoIP lookup is disabled and the application continues normally.URL_CHECKER_DB_PATH: Override default database path (alternative to--db-path)
URL Input
- URLs can be provided with or without
http://orhttps://prefix - If no scheme is provided,
https://is automatically prepended - Only
http://andhttps://URLs are accepted; other schemes are rejected - Invalid URLs are skipped with a warning
⚙️ Configuration
Status Endpoint (/status)
Returns detailed JSON status with real-time progress information:
|
Response Format:
Metrics Endpoint (/metrics)
Returns Prometheus-compatible metrics in text format:
Metrics:
domain_status_total_urls(gauge): Total URLs to processdomain_status_completed_urls(gauge): Successfully processed URLsdomain_status_failed_urls(gauge): Failed URLsdomain_status_percentage_complete(gauge): Completion percentage (0-100)domain_status_rate_per_second(gauge): Processing rate (URLs/sec)domain_status_errors_total(counter): Total error countdomain_status_warnings_total(counter): Total warning countdomain_status_info_total(counter): Total info event count
Prometheus Integration:
scrape_configs:
- job_name: 'domain_status'
static_configs:
- targets:
📊 Data Captured
The tool captures comprehensive information for each URL. The database uses a normalized star schema with a fact table (url_status) and multiple dimension/junction tables for multi-valued fields.
Data Types:
- HTTP/HTTPS: Status codes, response times, headers (security and general), redirect chains
- TLS/SSL: Certificate details, cipher suite, key algorithm, certificate OIDs, SANs
- DNS: NS, TXT, MX records; SPF and DMARC policies; reverse DNS
- HTML: Title, meta tags, structured data (JSON-LD, Open Graph, Twitter Cards), analytics IDs, social media links
- Technology Detection: CMS, frameworks, analytics tools detected via Wappalyzer rulesets
- GeoIP: Geographic location (country, region, city, coordinates) and network information (ASN)
- WHOIS: Domain registration information (registrar, creation/expiration dates, registrant info)
- Security: Security headers, security warnings, certificate validation
For detailed table descriptions and schema information, see DATABASE.md.
📊 Database Schema
The database uses a star schema design pattern with:
- Fact Table:
url_status(main URL data) - Dimension Table:
runs(run-level metadata) - Junction Tables: Multi-valued fields (technologies, headers, DNS records, etc.)
- One-to-One Tables:
url_geoip,url_whois - Failure Tracking:
url_failureswith related tables for error context (redirect chains, request/response headers)
Key Features:
- WAL mode for concurrent reads/writes
- UPSERT semantics:
UNIQUE (final_domain, timestamp)ensures idempotency - Comprehensive indexes for fast queries
- Normalized structure for efficient storage and analytics
For complete database schema documentation including entity-relationship diagrams, table descriptions, indexes, constraints, and query examples, see DATABASE.md.
📊 Output
The tool provides detailed logging with progress updates and error summaries:
Plain format (default):
✔️ domain_status [INFO] Processed 88 lines in 128.61 seconds (~0.68 lines/sec)
✔️ domain_status [INFO] Run statistics: total=100, successful=88, failed=12
✔️ domain_status [INFO] Error Counts (21 total):
✔️ domain_status [INFO] Bot detection (403 Forbidden): 4
✔️ domain_status [INFO] Process URL timeout: 3
✔️ domain_status [INFO] DNS NS lookup error: 2
...
Performance Analysis (--show-timing):
Use the --show-timing flag to display detailed timing metrics:
Example output:
=== Timing Metrics Summary (88 URLs) ===
Average times per URL:
HTTP Request: 1287 ms (40.9%)
DNS Forward: 845 ms (26.8%)
TLS Handshake: 1035 ms (32.9%)
HTML Parsing: 36 ms (1.1%)
Tech Detection: 1788 ms (56.8%)
Total: 3148 ms
JSON format (--log-format json):
Note: Performance varies significantly based on rate limiting, network conditions, target server behavior, and error handling. Expect 0.5-2 lines/sec with default settings. Higher rates may trigger bot detection.
🚀 Performance & Scalability
- Concurrent Processing: Default 30 concurrent requests (configurable via
--max-concurrency) - Adaptive Rate Limiting: Automatic RPS adjustment based on error rates (always enabled)
- Starts at initial RPS (default: 15)
- Monitors 429 errors and timeouts in a sliding window
- Automatically reduces RPS by 50% when error rate exceeds threshold (default: 20%)
- Gradually increases RPS by 15% when error rate is below threshold
- Maximum RPS capped at 2x initial value
- Resource Efficiency: Shared HTTP clients, DNS resolver, and HTML parser instances
- Database Optimization: SQLite WAL mode for concurrent writes, indexed queries
- Memory Safety: Response body size capped at 2MB, redirect chains limited to 10 hops
- Timeout Protection: Per-URL processing timeout (35 seconds) prevents hung requests
Retry & Error Handling:
- Automatic Retries: Exponential backoff (initial: 500ms, max: 15s, max attempts: 3)
- Error Rate Limiting: Monitors error rate and automatically throttles when threshold exceeded
- Processing Statistics: Comprehensive tracking with errors, warnings, and info metrics
- Graceful Degradation: Invalid URLs skipped, non-HTML responses filtered, oversized responses truncated
🛠️ Technical Details
Dependencies:
- HTTP Client:
reqwestwithrustlsTLS backend - DNS Resolution:
hickory-resolver(async DNS with system config fallback) - Domain Extraction:
tldextractfor accurate domain parsing (handles multi-part TLDs correctly) - HTML Parsing:
scraper(CSS selector-based extraction) - TLS/Certificates:
tokio-rustlsandx509-parserfor certificate analysis - Technology Detection: Custom implementation using Wappalyzer rulesets with JavaScript execution via
rquickjs - WHOIS/RDAP:
whois-servicecrate for domain registration lookups - GeoIP:
maxminddbfor geographic and network information - Database:
sqlxwith SQLite (WAL mode enabled) - Async Runtime: Tokio
Security:
- Security Audit:
cargo-auditruns in CI to detect known vulnerabilities in dependencies (uses RustSec advisory database) - Secret Scanning:
gitleaksscans commits and code for accidentally committed secrets, API keys, tokens, and credentials - Code Quality: Clippy with
-D warningsenforces strict linting rules and catches security issues
🏗️ Architecture
domain_status follows a pipeline architecture:
Input File → URL Validation → Concurrent Processing → Data Extraction → Direct Database Writes → SQLite Database
Core Components:
- Main Orchestrator: Reads URLs, validates, normalizes, manages concurrency
- HTTP Request Handler: Fetches URLs, follows redirects, extracts response data
- Data Extraction: Parses HTML, detects technologies, queries DNS/TLS/GeoIP/WHOIS (parallelized where possible)
- Database Writer: Direct writes to SQLite (WAL mode handles concurrency efficiently)
- Error Handling: Categorizes errors, implements retries with exponential backoff
- Rate Limiting: Token-bucket algorithm with adaptive adjustment
Concurrency Model:
- Async runtime: Tokio
- Concurrency control: Semaphore limits concurrent tasks
- Rate limiting: Token-bucket with adaptive adjustment
- Background tasks: Status server (optional), adaptive rate limiter
- Graceful shutdown: All background tasks cancellable via
CancellationToken - Parallel execution: Technology detection and DNS/TLS fetching run in parallel (independent operations)
Performance Characteristics:
- Non-blocking I/O for all network operations
- Shared resources (HTTP client, DNS resolver, database pool) across tasks
- Bounded concurrency prevents resource exhaustion
- Direct database writes with SQLite WAL mode (efficient concurrent writes)
- Memory efficiency: Response bodies limited to 2MB, HTML text extraction limited to 50KB
🔒 Security & Secret Management
Preventing Credential Leaks:
-
Pre-commit hooks (recommended): Install pre-commit hooks to catch secrets before they're committed:
# Install pre-commit (if not already installed) # or: pip install pre-commit # Install hooksThis will automatically scan for secrets before every commit.
-
CI scanning: Gitleaks runs in CI to catch secrets in pull requests and scan git history.
-
GitHub Secret Scanning: GitHub automatically scans public repositories for known secret patterns (enabled by default).
-
Best practices:
- Never commit
.envfiles (already in.gitignore) - Use environment variables for all secrets
- Use GitHub Secrets for CI/CD tokens
- Review gitleaks output if CI fails
- Never commit
License
MIT License - see LICENSE file for details.