crw-cli 0.0.12

crw — CLI tool for scraping URLs to markdown/JSON without a server
crw-cli-0.0.12 is not a library.

crw-cli

Standalone CLI tool for scraping URLs to markdown, JSON, HTML, or plain text — no server needed.

crates.io license

Overview

crw-cli is a single-binary web scraper that fetches any URL and outputs clean content to stdout. Part of the CRW project — same extraction engine as the server, but with zero setup.

  • 6 output formats — markdown, JSON, HTML, raw HTML, plain text, links
  • Main content extraction — automatically strips nav, footer, ads, scripts
  • CSS selector & XPath — extract specific elements before conversion
  • Stealth mode — User-Agent rotation and browser-like headers
  • JS rendering — optional CDP-based rendering for SPAs (via --js flag)
  • Proxy support — per-request HTTP/HTTPS proxy
  • File output — write directly to a file with -o

Installation

cargo install crw-cli

This installs the crw binary.

Usage

Basic scraping

# Scrape a page to markdown (default)
crw https://example.com

# Output as JSON (includes all metadata)
crw https://example.com --format json

# Output as plain text
crw https://example.com --format text

# Output as HTML (cleaned)
crw https://example.com --format html

# Output raw HTML (no cleaning)
crw https://example.com --format rawhtml

# Extract all links
crw https://example.com --format links

CSS selector extraction

Extract only specific elements:

# Extract just the article content
crw https://blog.example.com --css 'article.post'

# Extract the main heading
crw https://example.com --css 'h1'

XPath extraction

# Extract all paragraph text
crw https://example.com --xpath '//p'

# Extract a specific element by ID
crw https://example.com --xpath '//*[@id="content"]'

Save to file

crw https://example.com -o page.md
crw https://example.com --format json -o page.json

Full page content (no main content extraction)

By default, crw strips boilerplate (nav, footer, ads). Use --raw to get everything:

crw https://example.com --raw

Stealth mode

Rotate User-Agent and inject browser-like headers to reduce bot detection:

crw https://protected-site.com --stealth

Proxy

Route requests through a proxy:

crw https://example.com --proxy http://user:pass@proxy:8080

JavaScript rendering

For SPAs that require JavaScript, use --js with a CDP endpoint:

# Start LightPanda (or any CDP-compatible browser)
lightpanda serve --host 127.0.0.1 --port 9222 &

# Scrape with JS rendering
CRW_CDP_URL=ws://127.0.0.1:9222 crw https://spa-app.com --js

All options

Usage: crw [OPTIONS] <URL>

Arguments:
  <URL>  URL to scrape (http or https)

Options:
  -f, --format <FORMAT>      Output format [default: markdown]
                              [values: markdown, json, html, rawhtml, text, links]
  -o, --output <FILE>        Write output to file instead of stdout
      --raw                  Disable main content extraction (full page)
      --js                   Force JS rendering (requires CRW_CDP_URL env var)
      --css <SELECTOR>       Extract only elements matching this CSS selector
      --xpath <EXPR>         Extract only elements matching this XPath expression
      --proxy <URL>          HTTP/HTTPS proxy URL
      --stealth              Enable stealth mode (UA rotation + browser headers)
  -h, --help                 Print help

Part of CRW

This crate is part of the CRW workspace — a fast, lightweight, Firecrawl-compatible web scraper built in Rust.

Crate Description
crw-core Core types, config, and error handling
crw-renderer HTTP + CDP browser rendering engine
crw-extract HTML → markdown/plaintext extraction
crw-crawl Async BFS crawler with robots.txt & sitemap
crw-server Firecrawl-compatible API server
crw-cli Standalone CLI — crw binary (this crate)
crw-mcp MCP stdio proxy binary

License

AGPL-3.0 — see LICENSE.