crw-cli
Standalone CLI tool for scraping URLs to markdown, JSON, HTML, or plain text — no server needed.
Overview
crw-cli is a single-binary web scraper that fetches any URL and outputs clean content to stdout. Part of the CRW project — same extraction engine as the server, but with zero setup.
- 6 output formats — markdown, JSON, HTML, raw HTML, plain text, links
- Main content extraction — automatically strips nav, footer, ads, scripts
- CSS selector & XPath — extract specific elements before conversion
- Stealth mode — User-Agent rotation and browser-like headers
- JS rendering — optional CDP-based rendering for SPAs (via
--jsflag) - Proxy support — per-request HTTP, HTTPS, or SOCKS5 proxy
- File output — write directly to a file with
-o
Installation
This installs the crw binary.
Usage
Basic scraping
# Scrape a page to markdown (default)
# Output as JSON (includes all metadata)
# Output as plain text
# Output as HTML (cleaned)
# Output raw HTML (no cleaning)
# Extract all links
CSS selector extraction
Extract only specific elements:
# Extract just the article content
# Extract the main heading
XPath extraction
# Extract all paragraph text
# Extract a specific element by ID
Save to file
Full page content (no main content extraction)
By default, crw strips boilerplate (nav, footer, ads). Use --raw to get everything:
Stealth mode
Rotate User-Agent and inject browser-like headers to reduce bot detection:
Proxy
Route requests through a proxy:
JavaScript rendering
For SPAs that require JavaScript, use --js. The CLI auto-detects a locally
installed LightPanda or Chrome; you can also point at an existing CDP WebSocket
with CRW_CDP_URL:
# Auto-detect (spawns LightPanda/Chrome if installed)
# Or point at a running CDP endpoint
CRW_CDP_URL=ws://127.0.0.1:9222
Note: CRW_CDP_URL is only honored by crw (the CLI). In server/MCP mode use
[renderer.lightpanda.ws_url] / [renderer.chrome.ws_url] (or the matching
CRW_RENDERER__*__WS_URL env vars).
All options
Usage: crw [OPTIONS] <URL>
Arguments:
<URL> URL to scrape (http or https)
Options:
-f, --format <FORMAT> Output format [default: markdown]
[values: markdown, json, html, rawhtml, text, links]
-o, --output <FILE> Write output to file instead of stdout
--raw Disable main content extraction (full page)
--js Force JS rendering (auto-detects LightPanda/Chrome, or CRW_CDP_URL)
--css <SELECTOR> Extract only elements matching this CSS selector
--xpath <EXPR> Extract only elements matching this XPath expression
--proxy <URL> HTTP, HTTPS, or SOCKS5 proxy URL (e.g. socks5://user:pass@host:1080)
--stealth Enable stealth mode (UA rotation + browser headers)
-h, --help Print help
Part of CRW
This crate is part of the CRW workspace — a fast, lightweight, Firecrawl-compatible web scraper built in Rust.
| Crate | Description |
|---|---|
| crw-core | Core types, config, and error handling |
| crw-renderer | HTTP + CDP browser rendering engine |
| crw-extract | HTML → markdown/plaintext extraction |
| crw-crawl | Async BFS crawler with robots.txt & sitemap |
| crw-server | Firecrawl-compatible API server |
| crw-cli | Standalone CLI — crw binary (this crate) |
| crw-mcp | MCP stdio proxy binary |
License
AGPL-3.0 — see LICENSE.