web-capture (Rust)
A CLI and microservice to fetch URLs and render them as:
- Markdown: Converted from HTML with image extraction (default)
- HTML: Rendered page content
- Plain text: Raw text downloads for paste-like URLs such as xpaste.pro
- PNG screenshot: Full page capture
This is the Rust implementation of web-capture, providing the same API as the JavaScript version.
Installation
From crates.io
From Source
Quick Start
CLI Usage
# Capture a URL as Markdown (default format)
# Output auto-derived to ./data/web-capture/<host>/<path>/document.md
# Capture as Markdown to a specific file
# Write to stdout explicitly
# Capture as HTML
# Capture raw paste text
# Capture a GitHub repository as compact text or Markdown
# Take a screenshot
# Create a ZIP archive
# Keep images inline (opt-in)
# Structured search-provider capture (JSON by default)
# Start as API server
# Start server on custom port
API Endpoints (Server Mode)
- Markdown:
GET /markdown?url=<URL>(original links kept, base64 stripped by default) - Markdown (kreuzberg):
GET /markdown?url=<URL>&converter=kreuzberg - Markdown (structured JSON):
GET /markdown?url=<URL>&converter=kreuzberg&format=json - Markdown (base64 inline):
GET /markdown?url=<URL>&embedImages=true - Markdown (all images stripped):
GET /markdown?url=<URL>&keepOriginalLinks=false - HTML:
GET /html?url=<URL> - Text:
GET /txt?url=<URL>(xpaste.pro paste URLs normalize to/raw) - PNG screenshot:
GET /image?url=<URL> - Search:
GET /search?q=<QUERY>&provider=<PROVIDER>&format=json|markdown
For xpaste.pro paste URLs, /markdown captures the visual paste page in visible
order and appends the raw paste text as xpaste-pro-<id>.txt when the final
Markdown stays under 1500 lines. Larger paste pages return a ZIP containing
index.md, xpaste-pro-<id>.md, and xpaste-pro-<id>.txt. Canonical,
localized, and /raw paste URLs are normalized before capture.
For plain GitHub repository URLs such as https://github.com/owner/repo,
/markdown and /txt return compact repository snapshots with repository
metadata, the root file tree, and README content. GitHub subpages continue
through the regular capture path.
Search Endpoint
GET /search?q=<QUERY>&provider=<PROVIDER>&format=json|markdown
Captures structured results from a search provider in a normalized,
machine-readable shape. wikipedia (default) uses the CORS-friendly REST API;
the HTML engines (duckduckgo, google, bing, brave) are parsed
server-side. Blocked or CAPTCHA-gated pages are reported through diagnostics.
| Parameter | Required | Description | Default |
|---|---|---|---|
q |
Yes | Search query (query accepted as an alias) |
- |
provider |
No | wikipedia, duckduckgo, google, bing, brave |
wikipedia |
limit |
No | Maximum number of results | 10 |
format |
No | Response format: json or markdown |
json |
The JSON shape matches the JavaScript implementation:
CLI Reference
Server Mode
Start the API server:
| Option | Short | Description | Default |
|---|---|---|---|
--serve |
-s |
Start as HTTP API server | - |
--port |
-p |
Port to listen on | 3000 (or PORT env) |
Capture Mode
Capture a URL directly:
| Option | Short | Description | Default |
|---|---|---|---|
--format |
-f |
Output format: markdown/md, html, txt/text, image/png |
markdown |
--output |
-o |
Output file path. Use -o - for stdout |
auto-derived from URL |
--capture |
Capture method: browser or api |
browser |
|
--data-dir |
Base directory for auto-derived output paths | ./data/web-capture |
|
--embed-images |
Keep images as inline base64 data URIs | false | |
--no-extract-images |
Alias for --embed-images |
false | |
--keep-original-links |
Keep original remote URLs, strip base64 | false | |
--images-dir |
Subdirectory name for extracted images | images |
|
--archive |
Create archive: zip, 7z, tar.gz, tar |
- | |
--extract-latex |
Extract LaTeX formulas | true | |
--no-extract-latex |
Disable LaTeX extraction | - | |
--extract-metadata |
Extract article metadata | true | |
--no-extract-metadata |
Disable metadata extraction | - | |
--post-process |
Apply post-processing | true | |
--no-post-process |
Disable post-processing | - | |
--detect-code-language |
Detect code block languages | true | |
--no-detect-code-language |
Disable code language detection | - |
Search Mode
Capture structured search-provider results. Output defaults to JSON; pass
--format markdown for a human-readable document.
| Option | Short | Description | Default |
|---|---|---|---|
--provider |
wikipedia, duckduckgo, google, bing, brave |
wikipedia |
|
--limit |
Maximum number of results | 10 |
|
--format |
-f |
Output format: json or markdown |
json |
Examples
# Capture Markdown (default)
# Capture to specific file
# Write to stdout
# HTML format
# Raw paste text
# GitHub repository snapshot
# Google Docs live editor model
# Google Docs public export endpoint
# Google Docs REST API with OAuth token
# Tune browser-model quiescence for large or slow documents
WEB_CAPTURE_GDOCS_STABILITY_MS=2500 WEB_CAPTURE_GDOCS_MAX_WAIT_MS=60000 \
# Screenshot
# Pipe to another command
|
# Structured search (JSON by default)
# Search DuckDuckGo, limit to 5 results, render as Markdown
Docker
# Build and run
Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
PORT |
Server port | 3000 |
API_TOKEN |
API token for authenticated capture | - |
WEB_CAPTURE_DATA_DIR |
Base directory for output | ./data/web-capture |
WEB_CAPTURE_EMBED_IMAGES |
0/1 — keep images inline |
0 |
WEB_CAPTURE_KEEP_ORIGINAL_LINKS |
0/1 — keep original remote URLs |
0 |
WEB_CAPTURE_IMAGES_DIR |
Subdirectory for extracted images | images |
WEB_CAPTURE_EXTRACT_LATEX |
0/1 — extract LaTeX |
1 |
WEB_CAPTURE_EXTRACT_METADATA |
0/1 — extract metadata |
1 |
WEB_CAPTURE_POST_PROCESS |
0/1 — post-processing |
1 |
WEB_CAPTURE_DETECT_CODE_LANGUAGE |
0/1 — detect code langs |
1 |
RUST_LOG |
Log level (e.g. web_capture=debug) |
web_capture=info |
Library Usage
Add to your Cargo.toml:
[]
= "0.2"
Example
use ;
async
Testing
Some integration suites hit live servers and are skipped by default. Enable them with environment variables:
# Download the Wikipedia page (markdown + image) via the browser engine
WIKIPEDIA_INTEGRATION=1
# Download a GitHub repository page as compact txt/markdown, original HTML, and screenshot
GITHUB_REPOSITORY_INTEGRATION=1
# Public Google Docs live suite
GDOCS_INTEGRATION=1
Built With
- Axum - Web framework
- browser-commander - Browser automation
- html2md - HTML to Markdown conversion
- scraper - HTML parsing
- Tokio - Async runtime
License
Unlicense — This is free and unencumbered software released into the public domain. You are free to copy, modify, publish, use, compile, sell, or distribute this software for any purpose, commercial or non-commercial, and by any means. See https://unlicense.org for details.