docrawl 0.1.5 - Docs.rs

# docrawl Code Audit

## Overview

**docrawl** is a documentation-focused web crawler written in Rust that converts websites to clean Markdown with YAML frontmatter. It ships as both a library and a CLI. It auto-detects doc frameworks (Docusaurus, MkDocs, Sphinx), mirrors URL paths to a local filesystem structure, respects `robots.txt`, includes prompt-injection detection for LLM safety, and supports resumable crawls via a persistent `sled` cache.

**Architecture:** Async worker pool (Tokio + semaphores), rate-limited (Governor), with an `mpsc` channel feeding discovered URLs back to the main loop. Modular: `crawler.rs` (engine), `extract.rs` (HTML parsing), `security.rs` (sanitization), `sink.rs` (storage trait), `cache.rs` (persistent dedup), `robots.rs`, `sitemap.rs`, `manifest.rs`, `save.rs`, `util.rs`.

---

## Dead Code & Unused Code

### 1. `clean_with_soup()` — completely ineffective (HIGH)

**Location:** `src/extract.rs:96-104`

```rust
fn clean_with_soup(html: &str) -> String {
    let soup = Soup::new(html);
    for tag in ["nav", "header", "footer", "aside"].iter() {
        let _maybe = soup.tag(*tag).find(); // result thrown away
    }
    html.to_string() // always returns unchanged input
}
```

**Intended goal:** Strip non-content elements (`<nav>`, `<header>`, `<footer>`, `<aside>`) from the extracted HTML *before* markdown conversion. The idea was to use the `soup` crate to find these tags and remove them from the DOM so the final markdown wouldn't contain navigation bars, site headers, footers, or sidebars. This would produce cleaner output when the CSS-selector-based extraction (`extract_main_html`) grabbed a container that still included layout chrome.

**Why it's dead:** The `soup` crate's `find()` returns an `Option<Handle>` but does not support in-place removal or mutation of the parsed DOM. The author likely discovered this limitation, assigned the result to `_maybe` to silence the compiler, and added the `html.to_string()` fallback — but never circled back to implement it with a different approach (e.g., `lol_html` streaming removal, which is already used elsewhere in the project).

**Cascade:** This also makes the entire `soup` crate (`Cargo.toml:36`, `extract.rs:3`) dead weight.

---

### 2. `load_config()` — never called (MEDIUM)

**Location:** `src/config.rs:25-40`

```rust
pub fn load_config(working_dir: &Path, output_root: &Path) -> Config {
    let candidates = [
        working_dir.join("docrawl.config.json"),
        output_root.join("docrawl.config.json"),
    ];
    for path in candidates.iter() {
        if path.exists() {
            if let Ok(txt) = fs::read_to_string(path) {
                if let Ok(cfg) = serde_json::from_str::<Config>(&txt) {
                    return cfg;
                }
            }
        }
    }
    Config::default()
}
```

**Intended goal:** Allow users to place a `docrawl.config.json` in their working directory or output root to configure crawls without CLI flags. This would enable persistent per-project configuration — you'd drop a JSON file in your project and `docrawl` would pick up selectors, exclude patterns, host-only mode, etc. automatically. Two candidate paths are checked: the current working directory (for project-local config) and the output root (for config that lives alongside the crawl output).

**Why it's dead:** The CLI (`main.rs:66-87`) builds `Config` entirely from parsed CLI arguments and never calls `load_config()`. The function was likely written as part of an early design where file-based config was the primary interface, but once the CLI args were fleshed out, it became redundant and was never wired in. It's still exported as `pub`, suggesting it was intended to be part of the library's public API for programmatic users, but no code path invokes it.

---

### 3. `self_update` crate — never used (LOW)

**Location:** `Cargo.toml:68`

```toml
self_update = "0.41"
```

**Intended goal:** Use the `self_update` crate's structured self-update mechanism to check GitHub releases (or crates.io), download the latest binary, and replace the running executable — all without requiring `cargo` to be installed. The `self_update` crate provides a builder API for this (`self_update::backends::github::Update::configure()...`) that handles platform detection, archive extraction, and binary replacement.

**Why it's dead:** The `--update` flag implementation in `main.rs:27-51` shells out to `cargo install docrawl --force` instead:

```rust
let output = Command::new("cargo")
    .args(&["install", "docrawl", "--force"])
    .output();
```

This is simpler to implement but requires the user to have `cargo` installed, which defeats the purpose of `self_update`. The crate was likely added with the intention of implementing proper binary self-update, but the quick `cargo install` approach was used as a stopgap and `self_update` was never integrated. It remains as a compiled (but unused) dependency, adding to build time and binary size.

---

## Improvements

### Architecture / Correctness

#### 1. Massive closure in `crawl()` — extract the worker logic (HIGH)

**Location:** `src/crawler.rs:140-474`

The `enqueue` closure is ~330 lines, clones ~16 `Arc`s, and contains the entire page processing pipeline inline (fetch, parse, sanitize, download assets, save, enqueue links). This makes the code hard to read, test, and maintain. Extract the inner async block into a standalone `async fn process_url(...)` that takes a shared-state struct.

#### 2. Repeated "decrement pending + notify" pattern (MEDIUM)

**Location:** `src/crawler.rs` — 8 occurrences

The exact same 4-line block appears **8 times** across early-return paths:

```rust
pending_ctr.fetch_sub(1, Ordering::SeqCst);
if pending_ctr.load(Ordering::SeqCst) == 0 {
    notify.notify_waiters();
}
return;
```

Locations: ~211-216, ~222-228, ~232-237, ~243-248, ~276-280, ~285-289, ~295-300, ~325-330.

Extract into a helper function, or use a drop-guard struct that decrements + notifies on drop (eliminating all the manual returns).

#### 3. Repeated seed-enqueue code (MEDIUM)

**Location:** `src/crawler.rs:477-514`

The same `for (u, d) in seeds { enqueue(u, d); }` block appears **4 times** in the resume logic across different branches. Restructure the branching to compute the seed list first, then enqueue once.

#### 4. `is_same_host()` ignores port (LOW)

**Location:** `src/util.rs:27-29`

Compares `domain()` and `scheme()` but not port. A site on `:8080` would be treated as same-host as `:443`. The `host_only` mode in `within_scope()` correctly checks port via `port_or_known_default()`, but the default mode doesn't.

#### 5. Double SVG check in `fetch_asset_checked()` (LOW)

**Location:** `src/crawler.rs:664-676`

SVG is checked twice with slightly different logic. The first block uses `eq_ignore_ascii_case` while the second uses `==`. These could be unified into a single check.

---

### Performance

#### 6. Regex recompilation on every call (MEDIUM)

**Location:** `src/security.rs:65-94`

`sanitize_markdown()` compiles 7 regexes with `Regex::new().unwrap()` on every invocation. These should be `std::sync::LazyLock` statics (or `lazy_static!`). The `re_fence` regex on line 103 is also a duplicate of the pattern on line 79, compiled a second time.

#### 7. Regex recompilation in `rewrite_md_images()` (MEDIUM)

**Location:** `src/crawler.rs:691-705`

A new `Regex` is compiled for every image URL on every page. For pages with many images this is wasteful. Consider pre-building a single multi-pattern regex or using simple string replacement.

#### 8. `vec![]` where a static array suffices (LOW)

**Location:** `src/security.rs:65`

The patterns array is heap-allocated every call. Use `[...]` instead of `vec![...]`.

#### 9. `Ordering::SeqCst` everywhere (LOW)

**Location:** `src/crawler.rs` throughout

All atomics use `SeqCst`, the most expensive ordering. Most of these could use `Relaxed` or `AcqRel` since there's no cross-variable ordering dependency being relied upon.

---

### Robustness

#### 10. `escape_yaml()` is incomplete (MEDIUM)

**Location:** `src/save.rs:35-41`

Only quotes strings containing whitespace. YAML also requires quoting for strings containing `:`, `#`, `[`, `]`, `{`, `}`, `&`, `*`, `!`, `?`, `|`, `>`, `'`, `"`, `%`, `@`, and backticks, as well as strings that look like booleans/numbers (`true`, `null`, `1.0`). A page titled `true` or `key: value` would produce broken YAML frontmatter.

#### 11. No deduplication of extracted links (LOW)

**Location:** `src/extract.rs:106-131`

`extract_links()` returns all `<a>` href values without deduplication. A page with 50 links to the same URL will enqueue it 50 times (the visited set catches it, but the channel + task spawn overhead is wasted).

#### 12. No body size limit on fetch (MEDIUM)

**Location:** `src/crawler.rs:321`

`resp.text().await` reads the entire response body with no size limit. A malicious or accidental multi-GB HTML page would exhaust memory. Consider checking the `content-length` header or streaming with a cap.

#### 13. `unwrap()` on semaphore acquire (LOW)

**Location:** `src/crawler.rs:200`

`semaphore.acquire().await.unwrap()` will panic if the semaphore is closed. While unlikely in normal operation, graceful handling would be more robust.

#### 14. `NonZeroU32::unwrap()` on user input (LOW)

**Location:** `src/crawler.rs:54`

If `rate_limit_per_sec` is 0, this panics. The CLI defaults prevent this, but the library API doesn't validate — a programmatic caller could trigger it.

---

### Code Quality

#### 15. Duplicate `security_flags` check in save (LOW)

**Location:** `src/save.rs:21-29`

`!security_flags.is_empty()` is checked twice in succession (once for writing `quarantined: true`, once for the flags list). Combine into a single `if` block.

#### 16. `html2md` alias mismatch (LOW)

**Location:** `src/crawler.rs:350` vs `Cargo.toml:35`

Code calls `html2md::rewrite_html` but the dependency is named `fast_html2md`. The crate re-exports under the `html2md` name, but this is confusing for readers who grep for the dependency.

---

## Summary

| # | Category | Issue | Severity | Location |
|---|----------|-------|----------|----------|
| 1 | Dead code | `clean_with_soup()` does nothing | HIGH | `extract.rs:96-104` |
| 2 | Dead code | `load_config()` never called | MEDIUM | `config.rs:25-40` |
| 3 | Dead dep | `soup` crate only used by dead fn | MEDIUM | `Cargo.toml:36` |
| 4 | Dead dep | `self_update` crate never used | LOW | `Cargo.toml:68` |
| 5 | Improve | Extract 330-line closure into fn | HIGH | `crawler.rs:140-474` |
| 6 | Improve | Compile regexes once (lazy static) | MEDIUM | `security.rs:65-94` |
| 7 | Improve | Incomplete YAML escaping | MEDIUM | `save.rs:35-41` |
| 8 | Improve | No body size limit on fetch | MEDIUM | `crawler.rs:321` |
| 9 | Improve | 8x repeated pending+notify block | MEDIUM | `crawler.rs` (8 sites) |
| 10 | Improve | 4x repeated seed-enqueue block | MEDIUM | `crawler.rs:477-514` |
| 11 | Improve | Regex recompilation per call | MEDIUM | `security.rs`, `crawler.rs` |
| 12 | Improve | `is_same_host` ignores port | LOW | `util.rs:27-29` |
| 13 | Improve | No link dedup before enqueue | LOW | `extract.rs:106-131` |
| 14 | Improve | `SeqCst` on all atomics | LOW | `crawler.rs` throughout |
| 15 | Improve | `NonZeroU32::unwrap` on 0 input | LOW | `crawler.rs:54` |
| 16 | Improve | Double SVG check | LOW | `crawler.rs:664-676` |
| 17 | Improve | Duplicate security_flags check | LOW | `save.rs:21-29` |
| 18 | Improve | html2md alias confusion | LOW | `crawler.rs:350` |