crawn-0.1.0 is not a library.
Fast async web crawler with smart keyword filtering
Features
- Blazing fast – Built with Rust + tokio for async I/O
- Smart filtering – URL-based keyword matching (no content fetching required)
- NDJSON output – One JSON object per line for easy streaming
- BFS crawling – Depth-first traversal with configurable depth limits
- Rate limiting – Configurable request rate (default: 10 req/sec)
- Error recovery – Gracefully handles network errors and broken links
- Rich logging – Colored, timestamped logs with context chains
Installation
Run this command (requires cargo):
- Or build from source (requires cargo):
Usage
- Basic Crawling:
- With Logging:
- Verbose Mode (Log All Requests):
- Custom Depth Limit:
- Full HTML:
- Extracted text only:
Output Format
Results are written as NDJSON (newline-delimited JSON):
- With --include-text:
- With --include-content:
How It Works
- BFS Crawling:
- Starts at the seed URL (depth 0)
- Discovers links on each page
- Processes links level-by-level (breadth-first)
- Stops at max_depth (default: 4)
- Keyword Filtering:
- Extracts "keywords" from URL paths (sanitized, lowercased)
- Splits by /, -, _ (e.g., /rust-tutorials/async → ["rust", "tutorials", "async"])
- Filters stop words, numbers, short words (<3 chars)
- Matches candidate URLs against base keywords
- Result: Only crawls relevant pages, skips off-topic content
- Rate Limiting:
- Random range between 200 - 500ms
- Prevents server overload and IP bans
- Configurable via code (not exposed as CLI flag yet)
- Error Handling:
- Network errors: Logged as warnings, crawling continues
- HTTP 404/500: Skipped, logged as warnings
- Parse failures: Logged, returns empty JSON
- Fatal errors: Printed to stdout with full context chain
Logging
Log Levels:
- INFO (verbose mode only): Request logs
- WARN (always): Recoverable errors (404, network timeouts)
- FATAL (always): Unrecoverable errors (invalid URL, disk full)
Log Format:
2026-01-24 02:37:40.351 [INFO]:
Sent request to URL: https://example.com
2026-01-24 02:37:41.123 [WARN]:
Failed to fetch URL: https://example.com/broken-link
Caused by: HTTP 404 Not Found
Examples
- Crawl Documentation Site:
- Crawl with Logging:
- Limit to 2 Levels Deep:
Limitations
- Same-domain only (no external links, by design)
- No JavaScript rendering (static HTML only)
- No authentication (public pages only)
Notes
- crawn is licensed under the MIT license.
- For specifics about contributing to fiux, see CONTRIBUTING.md.