web-capture (Rust)
A CLI and microservice to fetch URLs and render them as:
- Markdown: Converted from HTML with image extraction (default)
- HTML: Rendered page content
- PNG screenshot: Full page capture
This is the Rust implementation of web-capture, providing the same API as the JavaScript version.
Installation
From crates.io
From Source
Quick Start
CLI Usage
# Capture a URL as Markdown (default format)
# Output auto-derived to ./data/web-capture/<host>/<path>/document.md
# Capture as Markdown to a specific file
# Write to stdout explicitly
# Capture as HTML
# Take a screenshot
# Create a ZIP archive
# Keep images inline (opt-in)
# Start as API server
# Start server on custom port
API Endpoints (Server Mode)
- Markdown:
GET /markdown?url=<URL>(original links kept, base64 stripped by default) - Markdown (base64 inline):
GET /markdown?url=<URL>&embedImages=true - Markdown (all images stripped):
GET /markdown?url=<URL>&keepOriginalLinks=false - HTML:
GET /html?url=<URL> - PNG screenshot:
GET /image?url=<URL>
CLI Reference
Server Mode
Start the API server:
| Option | Short | Description | Default |
|---|---|---|---|
--serve |
-s |
Start as HTTP API server | - |
--port |
-p |
Port to listen on | 3000 (or PORT env) |
Capture Mode
Capture a URL directly:
| Option | Short | Description | Default |
|---|---|---|---|
--format |
-f |
Output format: markdown/md, html, image/png |
markdown |
--output |
-o |
Output file path. Use -o - for stdout |
auto-derived from URL |
--capture |
Capture method: browser or api |
browser |
|
--data-dir |
Base directory for auto-derived output paths | ./data/web-capture |
|
--embed-images |
Keep images as inline base64 data URIs | false | |
--no-extract-images |
Alias for --embed-images |
false | |
--keep-original-links |
Keep original remote URLs, strip base64 | false | |
--images-dir |
Subdirectory name for extracted images | images |
|
--archive |
Create archive: zip, 7z, tar.gz, tar |
- | |
--extract-latex |
Extract LaTeX formulas | true | |
--no-extract-latex |
Disable LaTeX extraction | - | |
--extract-metadata |
Extract article metadata | true | |
--no-extract-metadata |
Disable metadata extraction | - | |
--post-process |
Apply post-processing | true | |
--no-post-process |
Disable post-processing | - | |
--detect-code-language |
Detect code block languages | true | |
--no-detect-code-language |
Disable code language detection | - |
Examples
# Capture Markdown (default)
# Capture to specific file
# Write to stdout
# HTML format
# Google Docs live editor model
# Google Docs public export endpoint
# Google Docs REST API with OAuth token
# Screenshot
# Pipe to another command
|
Docker
# Build and run
Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
PORT |
Server port | 3000 |
API_TOKEN |
API token for authenticated capture | - |
WEB_CAPTURE_DATA_DIR |
Base directory for output | ./data/web-capture |
WEB_CAPTURE_EMBED_IMAGES |
0/1 — keep images inline |
0 |
WEB_CAPTURE_KEEP_ORIGINAL_LINKS |
0/1 — keep original remote URLs |
0 |
WEB_CAPTURE_IMAGES_DIR |
Subdirectory for extracted images | images |
WEB_CAPTURE_EXTRACT_LATEX |
0/1 — extract LaTeX |
1 |
WEB_CAPTURE_EXTRACT_METADATA |
0/1 — extract metadata |
1 |
WEB_CAPTURE_POST_PROCESS |
0/1 — post-processing |
1 |
WEB_CAPTURE_DETECT_CODE_LANGUAGE |
0/1 — detect code langs |
1 |
RUST_LOG |
Log level (e.g. web_capture=debug) |
web_capture=info |
Library Usage
Add to your Cargo.toml:
[]
= "0.2"
Example
use ;
async
Built With
- Axum - Web framework
- browser-commander - Browser automation
- html2md - HTML to Markdown conversion
- scraper - HTML parsing
- Tokio - Async runtime
License
Unlicense — This is free and unencumbered software released into the public domain. You are free to copy, modify, publish, use, compile, sell, or distribute this software for any purpose, commercial or non-commercial, and by any means. See https://unlicense.org for details.