docrawl
Docs‑focused crawler that converts documentation sites to clean Markdown, mirrors the site’s path structure, and adds useful metadata — while staying polite and secure.
Features
- Markdown output with YAML frontmatter (
title,source_url,fetched_at, optionalsecurity_flags,quarantined) - Path‑mirroring folder structure with
index.mdper directory - Robots‑aware crawling (official matcher) and rate limiting
- Optional sitemap seeding from
/sitemap.xmlfor broader coverage on--all - Content extraction tuned for common docs frameworks with configurable selectors
- Same‑origin image download and Markdown relinking to local assets
- HTML/URL sanitization and prompt‑injection detection with quarantine
- Run manifest (
manifest.json) and persistent visited‑URL cache
Install
Prerequisites: Rust (stable toolchain).
cargo build --release
Binary: target/release/docrawl
Library Usage (Embed in Your Tool)
You can use docrawl as a library and ship a single binary for your own CLI or service.
Add to Cargo.toml (using the Git URL until it’s published on crates.io):
[dependencies]
docrawl = { git = "https://github.com/neur0map/docrawl" }
url = "2"
tokio = { version = "1", features = ["full"] }
Minimal programmatic crawl:
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let cfg = docrawl::CrawlConfig {
base_url: url::Url::parse("https://example.com/docs")?,
output_dir: std::path::PathBuf::from("./out"),
user_agent: format!("mytool/{}", env!("CARGO_PKG_VERSION")),
max_depth: Some(2),
rate_limit_per_sec: 8,
follow_sitemaps: false,
concurrency: 8,
timeout: None,
resume: false,
config: docrawl::Config { host_only: true, skip_assets: true, ..Default::default() },
};
let stats = docrawl::crawl(cfg).await?;
eprintln!("pages={} assets={}", stats.pages, stats.assets);
Ok(())
}
Notes:
docrawl::crawluses the default on-disk layout (same as the CLI) and returns simpleStats.- You own the UX: wire your flags to
CrawlConfigandConfigto keep one binary.
Quick Start
docrawl "https://example.com/docs" # default depth=10
docrawl "https://example.com" --all # full same‑origin site
docrawl "https://example.com" --depth 2 # start + links + their links
docrawl "https://example.com" -o ./export # choose output root
docrawl "https://example.com/docs" --fast # quick smoke test (no assets, higher rate)
CLI
url(positional): Absolute starting URL. Only same‑origin links are followed by default.--all: Unlimited depth. Crawls the whole same‑origin site (still honors robots).--depth <n>: Max link depth from the start page. Default is10when--allisn’t set.-o, --output <path>: Output root; a host‑named folder is created inside.--concurrency <n>: Number of parallel fetch workers (bounded). Default:8.--rate <n>: Global requests per second. Default:2.--timeout-minutes <n>: Graceful shutdown after N minutes. The crawler stops scheduling new work, drains active tasks, writes the manifest, and exits.--resume: Resume from the previous run using the persisted frontier (see Cache). Useful after a timeout or manual stop.--host-only: Restrict scope to the exact origin (scheme+host+port).--external-assets: Allow downloading images from other domains.--allow-svg: Permit saving SVG images.--no-assets: Skip image downloads (fastest).--max-pages <n>: Stop after writing this many pages.--selector <CSS>: Preferred CSS selector for content (repeatable).--exclude <REGEX>: Exclude URLs matching the regex (repeatable).--fast: Preset for quick crawls: raises--rate/--concurrencyto at least 16 and implies--no-assets.
Depth is link‑hop depth (not URL path depth). Hub‑style homepages can expose many pages even at small depths.
Configuration (optional)
Place docrawl.config.json in the working directory or in the output root. All fields are optional.
{
"host_only": true,
"external_assets": false,
"allow_svg": false,
"max_pages": 500,
"selectors": [".theme-doc-markdown", ".article-content"],
"exclude_patterns": ["\\.pdf$", "/private/"]
}
host_only: Limit scope to exact origin (scheme+host+port). Default: same‑domain allowed.external_assets: Allow images from outside the site’s origin/domain. Default: false.allow_svg: Permit saving SVG images. Default: false.max_pages: Stop after writing this many pages.selectors: Preferred CSS selectors for the main content; tried before built‑ins.exclude_patterns: Regex patterns; if a canonical URL matches any, it’s skipped.
Behavior
- Output root defaults to CWD; inside it a folder named after the host is created (e.g.,
./out/example.com/). - Depth 0: only the start page.
--alloverrides--depth. - Scope: same‑origin by default; exact origin if
host_onlyis true. - Robots: evaluated with the
robotstxtmatcher (agent‑specific allow/deny). - Sitemaps: when
--allis set, seeds from/sitemap.xmlto front‑load coverage (non‑recursive for nested indexes). - Extraction: tries common docs containers (Docusaurus, Sphinx, MkDocs) and falls back to
<main>,<article>,<body>; you can supplyselectorsto override. - Images: same‑origin images are downloaded into the site folder and Markdown links are rewritten to local relative paths. External or unsupported types are skipped unless enabled in config.
Security
- Sanitization (pre‑conversion): removes active/embedded elements (
script,iframe,object,form,link,meta, etc.), normalizes and neutralizes risky URLs. - Markdown scan (post‑conversion): detects likely prompt‑injection phrases and risky patterns; flags are recorded in frontmatter under
security_flags. - Quarantine: if any flags are detected, the page file is still created but the body is replaced with a short notice explaining why it was skipped; frontmatter includes
quarantined: true. - Asset safety: only safe image types (
png,jpeg/jpg,gif,webp,bmp) are downloaded by default. SVG anddata:images are disabled unless allowed via config.
Example frontmatter with flags:
---
title: Example
source_url: https://example.com/page
fetched_at: 2025-01-01T00:00:00Z
quarantined: true
security_flags:
- llm_ignore_previous
- javascript_link
---
Output
<output_root>/example.com/
├── index.md
├── guide/
│ └── index.md
├── assets/ # images saved under site paths
└── manifest.json
- Markdown files include YAML frontmatter. Images are referenced using relative paths.
manifest.jsonrecords id, url, relative path, title, quarantined,security_flags, and timestamps for each saved page.
Cache
- Location:
<output_root>/.docrawl_cache(sled key–value store). - Purpose: persist the visited set across runs so re‑runs don’t re‑fetch unchanged URLs.
- If the cache cannot be opened (e.g., read‑only filesystem), crawling proceeds with in‑memory dedupe only; Markdown output still works.
- Frontier persistence for resume: the crawler stores pending URLs in the cache. Use
--resumeto continue from where you left off (e.g., after--timeout-minutesor Ctrl‑C). If you don’t use--resume, the persisted frontier is cleared at the start and new seeds are used.
Proxy & macOS Notes
- The HTTP client honors
HTTP_PROXY/HTTPS_PROXYif present. If neither is set, system proxy detection is disabled to avoid rare macOS System Configuration panics. - If you need a proxy, export it before running:
export HTTPS_PROXY=http://user:pass@host:portexport HTTP_PROXY=http://user:pass@host:port
Development and Testing
- Build:
cargo buildorcargo build --release - Try a small depth first:
docrawl "https://example.com/docs" --depth 2 - Verify robots compliance: disallowed paths should be skipped.
- Tune extraction with
selectorsif needed for your site. - Confirm quarantined pages render a notice and include
security_flags.
Roadmap
- Conditional GETs (ETag/Last‑Modified)
- Recursive sitemap index traversal
- CLI flags mirroring config fields
- Site‑specific selector presets