docrawl 0.1.5

Docs-focused crawler library and CLI: crawl documentation sites, extract main content, convert to Markdown, mirror paths, and save with frontmatter.
Documentation
# Changelog

## 0.1.5

### Added

- **Config file loading**: `docrawl.config.json` is now loaded from the output directory or current working directory; CLI arguments take precedence over file values

### Fixed

- **`--fast` clobbering explicit flags**: `--fast --external-assets` no longer silently disables `external_assets`; same for `--allow-svg`
- **Incorrect CLI help defaults**: `--concurrency` help said 8 (actual: 16), `--rate` said 2 (actual: 10)
- **Invalid `--exclude` patterns silently ignored**: bad regex patterns now emit a warning instead of being dropped silently

## 0.1.4

### Improved

- **Progress UI**: replaced confusing `Pending` / `Pages` counters with `Saved`, `Skipped`, `Queue`, and `pg/s` rate; final summary now shows skip totals and omits assets when zero (fast mode)
- **www redirect handling**: `kali.org` and `www.kali.org` are now treated as the same host, so sites that redirect from bare domain to `www` are crawled correctly

### Fixed

- **Stale cache blocking fresh crawls**: non-resume runs now clear the visited and meta cache trees so previously-crawled URLs are not permanently skipped
- **Code blocks corrupted by regex transforms**: fenced code blocks (backtick and tilde) are now shielded before security scanning and markdown cleanup, preventing false-positive quarantine of pages with code examples and preserving `# comments`, `` ```system `` fences, and intentional whitespace inside code

## 0.1.3

- Self-updating feature (`--update`)
- README cleanup