web-capture (Rust)
A CLI and microservice to fetch URLs and render them as:
- Markdown: Converted from HTML with image extraction (default)
- HTML: Rendered page content
- PNG screenshot: Full page capture
This is the Rust implementation of web-capture, providing the same API as the JavaScript version.
Installation
From crates.io
From Source
Quick Start
CLI Usage
# Capture a URL as Markdown (default format)
# Output auto-derived to ./data/web-capture/<host>/<path>/document.md
# Capture as Markdown to a specific file
# Write to stdout explicitly
# Capture as HTML
# Take a screenshot
# Create a ZIP archive
# Keep images inline (opt-in)
# Structured search-provider capture (JSON by default)
# Start as API server
# Start server on custom port
API Endpoints (Server Mode)
- Markdown:
GET /markdown?url=<URL>(original links kept, base64 stripped by default) - Markdown (base64 inline):
GET /markdown?url=<URL>&embedImages=true - Markdown (all images stripped):
GET /markdown?url=<URL>&keepOriginalLinks=false - HTML:
GET /html?url=<URL> - PNG screenshot:
GET /image?url=<URL> - Search:
GET /search?q=<QUERY>&provider=<PROVIDER>&format=json|markdown
Search Endpoint
GET /search?q=<QUERY>&provider=<PROVIDER>&format=json|markdown
Captures structured results from a search provider in a normalized,
machine-readable shape. wikipedia (default) uses the CORS-friendly REST API;
the HTML engines (duckduckgo, google, bing, brave) are parsed
server-side. Blocked or CAPTCHA-gated pages are reported through diagnostics.
| Parameter | Required | Description | Default |
|---|---|---|---|
q |
Yes | Search query (query accepted as an alias) |
- |
provider |
No | wikipedia, duckduckgo, google, bing, brave |
wikipedia |
limit |
No | Maximum number of results | 10 |
format |
No | Response format: json or markdown |
json |
The JSON shape matches the JavaScript implementation:
CLI Reference
Server Mode
Start the API server:
| Option | Short | Description | Default |
|---|---|---|---|
--serve |
-s |
Start as HTTP API server | - |
--port |
-p |
Port to listen on | 3000 (or PORT env) |
Capture Mode
Capture a URL directly:
| Option | Short | Description | Default |
|---|---|---|---|
--format |
-f |
Output format: markdown/md, html, image/png |
markdown |
--output |
-o |
Output file path. Use -o - for stdout |
auto-derived from URL |
--capture |
Capture method: browser or api |
browser |
|
--data-dir |
Base directory for auto-derived output paths | ./data/web-capture |
|
--embed-images |
Keep images as inline base64 data URIs | false | |
--no-extract-images |
Alias for --embed-images |
false | |
--keep-original-links |
Keep original remote URLs, strip base64 | false | |
--images-dir |
Subdirectory name for extracted images | images |
|
--archive |
Create archive: zip, 7z, tar.gz, tar |
- | |
--extract-latex |
Extract LaTeX formulas | true | |
--no-extract-latex |
Disable LaTeX extraction | - | |
--extract-metadata |
Extract article metadata | true | |
--no-extract-metadata |
Disable metadata extraction | - | |
--post-process |
Apply post-processing | true | |
--no-post-process |
Disable post-processing | - | |
--detect-code-language |
Detect code block languages | true | |
--no-detect-code-language |
Disable code language detection | - |
Search Mode
Capture structured search-provider results. Output defaults to JSON; pass
--format markdown for a human-readable document.
| Option | Short | Description | Default |
|---|---|---|---|
--provider |
wikipedia, duckduckgo, google, bing, brave |
wikipedia |
|
--limit |
Maximum number of results | 10 |
|
--format |
-f |
Output format: json or markdown |
json |
Examples
# Capture Markdown (default)
# Capture to specific file
# Write to stdout
# HTML format
# Google Docs live editor model
# Google Docs public export endpoint
# Google Docs REST API with OAuth token
# Tune browser-model quiescence for large or slow documents
WEB_CAPTURE_GDOCS_STABILITY_MS=2500 WEB_CAPTURE_GDOCS_MAX_WAIT_MS=60000 \
# Screenshot
# Pipe to another command
|
# Structured search (JSON by default)
# Search DuckDuckGo, limit to 5 results, render as Markdown
Docker
# Build and run
Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
PORT |
Server port | 3000 |
API_TOKEN |
API token for authenticated capture | - |
WEB_CAPTURE_DATA_DIR |
Base directory for output | ./data/web-capture |
WEB_CAPTURE_EMBED_IMAGES |
0/1 — keep images inline |
0 |
WEB_CAPTURE_KEEP_ORIGINAL_LINKS |
0/1 — keep original remote URLs |
0 |
WEB_CAPTURE_IMAGES_DIR |
Subdirectory for extracted images | images |
WEB_CAPTURE_EXTRACT_LATEX |
0/1 — extract LaTeX |
1 |
WEB_CAPTURE_EXTRACT_METADATA |
0/1 — extract metadata |
1 |
WEB_CAPTURE_POST_PROCESS |
0/1 — post-processing |
1 |
WEB_CAPTURE_DETECT_CODE_LANGUAGE |
0/1 — detect code langs |
1 |
RUST_LOG |
Log level (e.g. web_capture=debug) |
web_capture=info |
Library Usage
Add to your Cargo.toml:
[]
= "0.2"
Example
use ;
async
Built With
- Axum - Web framework
- browser-commander - Browser automation
- html2md - HTML to Markdown conversion
- scraper - HTML parsing
- Tokio - Async runtime
License
Unlicense — This is free and unencumbered software released into the public domain. You are free to copy, modify, publish, use, compile, sell, or distribute this software for any purpose, commercial or non-commercial, and by any means. See https://unlicense.org for details.