docrawl

Docs‑focused crawler that converts documentation sites to clean Markdown, mirrors the site’s path structure, and adds useful metadata — while staying polite and secure.

Features

Markdown output with YAML frontmatter (title, source_url, fetched_at, optional security_flags, quarantined)
Path‑mirroring folder structure with index.md per directory
Robots‑aware crawling (official matcher) and rate limiting
Optional sitemap seeding from /sitemap.xml for broader coverage on --all
Content extraction tuned for common docs frameworks with configurable selectors
Same‑origin image download and Markdown relinking to local assets
HTML/URL sanitization and prompt‑injection detection with quarantine
Run manifest (manifest.json) and persistent visited‑URL cache

Install

Prerequisites: Rust (stable toolchain).

cargo build --release

Binary: target/release/docrawl

Library Usage (Embed in Your Tool)

You can use docrawl as a library and ship a single binary for your own CLI or service.

Add to Cargo.toml (using the Git URL until it’s published on crates.io):

[dependencies]
docrawl = { git = "https://github.com/neur0map/docrawl" }
url = "2"
tokio = { version = "1", features = ["full"] }

Minimal programmatic crawl:

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let cfg = docrawl::CrawlConfig {
        base_url: url::Url::parse("https://example.com/docs")?,
        output_dir: std::path::PathBuf::from("./out"),
        user_agent: format!("mytool/{}", env!("CARGO_PKG_VERSION")),
        max_depth: Some(2),
        rate_limit_per_sec: 8,
        follow_sitemaps: false,
        concurrency: 8,
        timeout: None,
        resume: false,
        config: docrawl::Config { host_only: true, skip_assets: true, ..Default::default() },
    };
    let stats = docrawl::crawl(cfg).await?;
    eprintln!("pages={} assets={}", stats.pages, stats.assets);
    Ok(())
}

Notes:

docrawl::crawl uses the default on-disk layout (same as the CLI) and returns simple Stats.
You own the UX: wire your flags to CrawlConfig and Config to keep one binary.

Quick Start

docrawl "https://example.com/docs"          # default depth=10
docrawl "https://example.com" --all         # full same‑origin site
docrawl "https://example.com" --depth 2     # start + links + their links
docrawl "https://example.com" -o ./export   # choose output root
docrawl "https://example.com/docs" --fast   # quick smoke test (no assets, higher rate)

CLI

url (positional): Absolute starting URL. Only same‑origin links are followed by default.
--all: Unlimited depth. Crawls the whole same‑origin site (still honors robots).
--depth <n>: Max link depth from the start page. Default is 10 when --all isn’t set.
-o, --output <path>: Output root; a host‑named folder is created inside.
--concurrency <n>: Number of parallel fetch workers (bounded). Default: 8.
--rate <n>: Global requests per second. Default: 2.
--timeout-minutes <n>: Graceful shutdown after N minutes. The crawler stops scheduling new work, drains active tasks, writes the manifest, and exits.
--resume: Resume from the previous run using the persisted frontier (see Cache). Useful after a timeout or manual stop.
--host-only: Restrict scope to the exact origin (scheme+host+port).
--external-assets: Allow downloading images from other domains.
--allow-svg: Permit saving SVG images.
--no-assets: Skip image downloads (fastest).
--max-pages <n>: Stop after writing this many pages.
--selector <CSS>: Preferred CSS selector for content (repeatable).
--exclude <REGEX>: Exclude URLs matching the regex (repeatable).
--fast: Preset for quick crawls: raises --rate/--concurrency to at least 16 and implies --no-assets.

Depth is link‑hop depth (not URL path depth). Hub‑style homepages can expose many pages even at small depths.

Configuration (optional)

Place docrawl.config.json in the working directory or in the output root. All fields are optional.

{
  "host_only": true,
  "external_assets": false,
  "allow_svg": false,
  "max_pages": 500,
  "selectors": [".theme-doc-markdown", ".article-content"],
  "exclude_patterns": ["\\.pdf$", "/private/"]
}

host_only: Limit scope to exact origin (scheme+host+port). Default: same‑domain allowed.
external_assets: Allow images from outside the site’s origin/domain. Default: false.
allow_svg: Permit saving SVG images. Default: false.
max_pages: Stop after writing this many pages.
selectors: Preferred CSS selectors for the main content; tried before built‑ins.
exclude_patterns: Regex patterns; if a canonical URL matches any, it’s skipped.

Behavior

Output root defaults to CWD; inside it a folder named after the host is created (e.g., ./out/example.com/).
Depth 0: only the start page. --all overrides --depth.
Scope: same‑origin by default; exact origin if host_only is true.
Robots: evaluated with the robotstxt matcher (agent‑specific allow/deny).
Sitemaps: when --all is set, seeds from /sitemap.xml to front‑load coverage (non‑recursive for nested indexes).
Extraction: tries common docs containers (Docusaurus, Sphinx, MkDocs) and falls back to <main>, <article>, <body>; you can supply selectors to override.
Images: same‑origin images are downloaded into the site folder and Markdown links are rewritten to local relative paths. External or unsupported types are skipped unless enabled in config.

Security

Sanitization (pre‑conversion): removes active/embedded elements (script, iframe, object, form, link, meta, etc.), normalizes and neutralizes risky URLs.
Markdown scan (post‑conversion): detects likely prompt‑injection phrases and risky patterns; flags are recorded in frontmatter under security_flags.
Quarantine: if any flags are detected, the page file is still created but the body is replaced with a short notice explaining why it was skipped; frontmatter includes quarantined: true.
Asset safety: only safe image types (png, jpeg/jpg, gif, webp, bmp) are downloaded by default. SVG and data: images are disabled unless allowed via config.

Example frontmatter with flags:

---
title: Example
source_url: https://example.com/page
fetched_at: 2025-01-01T00:00:00Z
quarantined: true
security_flags:
  - llm_ignore_previous
  - javascript_link
---

Output

<output_root>/example.com/
├── index.md
├── guide/
│   └── index.md
├── assets/            # images saved under site paths
└── manifest.json

Markdown files include YAML frontmatter. Images are referenced using relative paths.
manifest.json records id, url, relative path, title, quarantined, security_flags, and timestamps for each saved page.

Cache

Location: <output_root>/.docrawl_cache (sled key–value store).
Purpose: persist the visited set across runs so re‑runs don’t re‑fetch unchanged URLs.
If the cache cannot be opened (e.g., read‑only filesystem), crawling proceeds with in‑memory dedupe only; Markdown output still works.
Frontier persistence for resume: the crawler stores pending URLs in the cache. Use --resume to continue from where you left off (e.g., after --timeout-minutes or Ctrl‑C). If you don’t use --resume, the persisted frontier is cleared at the start and new seeds are used.

Proxy & macOS Notes

The HTTP client honors HTTP_PROXY/HTTPS_PROXY if present. If neither is set, system proxy detection is disabled to avoid rare macOS System Configuration panics.
If you need a proxy, export it before running:
- export HTTPS_PROXY=http://user:pass@host:port
- export HTTP_PROXY=http://user:pass@host:port

Development and Testing

Build: cargo build or cargo build --release
Try a small depth first: docrawl "https://example.com/docs" --depth 2
Verify robots compliance: disallowed paths should be skipped.
Tune extraction with selectors if needed for your site.
Confirm quarantined pages render a notice and include security_flags.

Roadmap

Conditional GETs (ETag/Last‑Modified)
Recursive sitemap index traversal
CLI flags mirroring config fields
Site‑specific selector presets

docrawl 0.1.0