A web crawler that fetches HTML content, converts it to Markdown, and processes it with mq queries.
Why mq-crawler?
Make web scraping and content extraction effortless with intelligent Markdown conversion:
- HTML to Markdown: Automatically convert crawled HTML pages to clean, structured Markdown
- Ethical Crawling: Built-in robots.txt compliance to crawl responsibly
- mq Integration: Process crawled content with powerful mq queries for filtering and transformation
- JavaScript Support: Browser-based crawling with WebDriver for dynamic content
- High Performance: Parallel processing with configurable concurrency for faster crawling
- Flexible Output: Save to files or stream to stdout
Features
- Web Crawling: Fetch and process web pages with configurable depth and delay
- HTML to Markdown: Automatic conversion with customizable options
- Robots.txt Compliance: Respects robots.txt rules for ethical crawling
- mq Query Integration: Filter and transform crawled content on-the-fly
- Parallel Processing: Concurrent workers for faster crawling
- Depth Control: Limit crawl depth to control scope
- Rate Limiting: Configurable delays to avoid overloading servers
- Statistics: Track crawling progress and results
- Headless Chrome: Built-in headless Chrome for JavaScript-heavy sites (no external server needed)
- WebDriver Support: Use Selenium WebDriver for browser-based crawling
- Domain Filtering: Restrict crawling to specific domains
Installation
Quick Install (Recommended)
|
The installer will:
- Download the latest
mq-crawlbinary for your platform - Install it to
~/.mq/bin/ - Verify the checksum of the downloaded binary
- Update your shell profile to add
mq-crawlto your PATH
After installation, restart your terminal or source your shell profile, then verify:
Homebrew
Cargo
From Source
Usage
Basic Crawling
# Crawl a website and output to stdout
# Save crawled content to directory
# Crawl with custom delay (default: 0.5 seconds)
# Limit crawl depth
Processing with mq Queries
# Extract only headings from crawled pages
# Extract all code blocks
# Extract and transform links
Parallel Crawling
# Crawl with 3 concurrent workers
# High-speed crawling with 10 workers
Custom Robots.txt
# Use custom robots.txt file
HTML to Markdown Options
# Extract scripts as code blocks
# Generate YAML front matter with metadata
# Use page title as H1 heading
# Combine multiple options
Output Formats
# Output as JSON
# Output as text (default)
Domain Filtering
# Crawl only the start URL's domain (default behavior)
# Also crawl docs.example.com and blog.example.com
# The start URL's domain (example.com) is always included automatically
Browser-Based Crawling (Headless Chrome)
For JavaScript-heavy sites, use the built-in headless Chrome without an external server:
# Use built-in headless Chrome (Chrome or Chromium must be installed)
# Specify a custom Chrome/Chromium executable path
Browser-Based Crawling (WebDriver)
Alternatively, use an external Selenium WebDriver server:
# Start Selenium server first
# docker run -d -p 4444:4444 selenium/standalone-chrome
# Crawl with WebDriver
# Custom timeouts
Command Line Options
<URL> The
)
)
)
)
)
;
Development
Building from Source
Running Tests
Support
License
Licensed under the MIT License.