docrawl
@@@@@@@ @@@@@@ @@@@@@@ @@@@@@@ @@@@@@ @@@ @@@ @@@ @@@
@@@@@@@@ @@@@@@@@ @@@@@@@@ @@@@@@@@ @@@@@@@@ @@@ @@@ @@@ @@@
@@! @@@ @@! @@@ !@@ @@! @@@ @@! @@@ @@! @@! @@! @@!
!@! @!@ !@! @!@ !@! !@! @!@ !@! @!@ !@! !@! !@! !@!
@!@ !@! @!@ !@! !@! @!@!!@! @!@!@!@! @!! !!@ @!@ @!!
!@! !!! !@! !!! !!! !!@!@! !!!@!!!! !@! !!! !@! !!!
!!: !!! !!: !!! :!! !!: :!! !!: !!! !!: !!: !!: !!:
:!: !:! :!: !:! :!: :!: !:! :!: !:! :!: :!: :!: :!:
:::: :: ::::: :: ::: ::: :: ::: :: ::: :::: :: ::: :: ::::
:: : : : : : :: :: : : : : : : : :: : : : : :: : :
A documentation-focused web crawler that converts sites to clean Markdown while preserving structure and staying polite.
Demo Video • Crates.io • GitHub
Installation
# Install from crates.io
# Or build from source
Quick Start
Key Features
- Documentation-optimized extraction - Built-in selectors for Docusaurus, MkDocs, Sphinx, Next.js docs
- Clean Markdown output - Preserves code blocks, tables, and formatting with YAML frontmatter metadata
- Path-mirroring structure - Maintains original URL hierarchy as folders with
index.mdfiles - Polite crawling - Respects robots.txt, rate limits, and sitemap hints
- Security-first - Sanitizes content, detects prompt injections, quarantines suspicious pages
- Self-updating - Built-in update mechanism via
docrawl --update
Why docrawl?
Unlike general-purpose crawlers, docrawl is purpose-built for documentation:
| Tool | Purpose | Output | Documentation Support |
|---|---|---|---|
| wget/curl | File downloading | Raw HTML | No extraction |
| httrack | Website mirroring | Full HTML site | No Markdown conversion |
| scrapy | Web scraping framework | Custom formats | Requires coding |
| docrawl | Documentation crawler | Clean Markdown | Auto-detects docs frameworks |
docrawl combines crawling, extraction, and conversion in a single tool optimized for technical documentation.
Library Usage
Add to your Cargo.toml:
[]
= "0.1"
= { = "1", = ["full"] }
Minimal example:
async
CLI Options
| Option | Description | Default |
|---|---|---|
--depth <n> |
Maximum crawl depth | 10 |
--all |
Crawl entire site | - |
--output <dir> |
Output directory | Current dir |
--rate <n> |
Requests per second | 2 |
--concurrency <n> |
Parallel workers | 8 |
--selector <css> |
Custom content selector | Auto-detect |
--fast |
Quick mode (no assets, rate=16) | - |
--resume |
Continue previous crawl | - |
--silence |
Suppress built-in progress/status output | - |
--update |
Update to latest version from crates.io | - |
Configuration
Optional docrawl.config.json:
Output Structure
output/
└── example.com/
├── index.md
├── guide/
│ └── index.md
├── assets/
│ └── images/
└── manifest.json
Each Markdown file includes frontmatter:
---
title: Page Title
source_url: https://example.com/page
fetched_at:
---
Performance
docrawl is optimized for speed and efficiency:
- Fast HTML to Markdown conversion using
fast_html2md - Concurrent processing with configurable worker pools
- Intelligent rate limiting to respect server resources
- Persistent caching to avoid duplicate work
- Memory-efficient streaming for large sites
Security
docrawl includes built-in security features:
- Content sanitization removes potentially harmful HTML
- Prompt injection detection identifies and quarantines suspicious content
- URL validation prevents malicious redirects
- File system sandboxing restricts output to specified directories
- Rate limiting prevents overwhelming target servers
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
License
MIT