twars-url2md
twars-url2md is a robust command-line tool written in Rust that fetches web pages, cleans up their HTML content, and converts them into clean Markdown. It leverages Monolith for content extraction and htmd for the conversion process, ensuring that the resulting Markdown preserves the document's logical structure.
Table of Contents
- Features
- Installation
- Usage
- Configuration & Retry Mechanism
- Development and Testing
- CI/CD and Release Process
- Contributing
- License
- Author
Features
-
Powerful Content Extraction
- Uses Monolith to fetch and process web content
- Strips unwanted assets (CSS, JavaScript, images, videos, fonts)
- Preserves essential HTML structure
- Handles proper character encoding detection
-
Smart URL Processing
- Extracts and validates URLs from plain text, HTML, and Markdown
- Supports relative URL resolution with base URL
- Filters out invalid URLs and duplicates
- Handles special characters and complex URL structures
-
Flexible Input Options
- File input (one URL per line)
- Standard input (pipe URLs)
- Command-line arguments
- Base URL specification for relative links
-
Robust Output Management
- Organized directory hierarchy based on URL structure
- Smart filename generation (
index.mdfor root/trailing slash) - Proper handling of special characters
- Optional single file or directory-based output
-
Advanced Processing Features
- Parallel URL processing with progress indication
- Exponential backoff retry mechanism
- Comprehensive error reporting
- Cross-platform compatibility
Installation
From Crates.io
Make sure you have Rust (MSRV: 1.70.0 or later) installed, then run:
From Binary Releases
Pre-built binaries are available for:
- Linux (x86_64)
- macOS (Universal binary for Intel and Apple Silicon)
- Windows (x86_64)
Download from the Releases page.
From Source
Clone the repository and install locally:
Usage
The tool accepts URLs via a file or standard input and converts each page into a Markdown file. It also supports a base URL for resolving relative links.
Input Options
--input <FILE>Read URLs from a specified file (one URL per line)--stdinRead URLs from standard input--base_url <URL>Base URL for resolving relative links- Note: Do not use both
--inputand--stdinsimultaneously
Output Options
--output <DIR>Specify output directory for Markdown files- If no output directory is specified, content is printed to stdout
Output Organization
For URLs like scheme://username:password@host:port/path?query#fragment:
- Username, password, port, query parameters, and fragments are ignored
- Files are organized by host and path components
- URLs ending in
/or with no path useindex.md - Other URLs use the last path component with
.mdextension
Example structure:
output/
├── example.com/
│ ├── index.md
│ └── articles/
│ └── page.md
└── another-site.com/
└── post/
└── article.md
Examples
# Process a single URL and print to stdout
# Process URLs from a file with specific output directory
# Process piped URLs with base URL for relative links
|
# Show verbose output (enabled by default)
Batch work
# This downloads 260+ links
|
# This downloads 11k+ links from all the files downloaded previously
|
Configuration & Retry Mechanism
- Parallel Processing: Uses tokio for concurrent URL processing
- Progress Tracking: Displays progress bar for multiple URLs
- Retry Logic:
- Up to 2 retries per URL
- Exponential backoff between attempts
- Detailed error reporting for failed URLs
- Verbose Mode: Enabled by default for processing information
Development and Testing
Running Tests
# Run all tests
# Run with all features
# Run specific test
Code Quality Tools
- Formatting:
cargo fmt - Linting:
cargo clippy --all-targets --all-features - Pre-commit Hooks: Runs formatting, clippy, and basic checks
Dependencies
Key crates used:
monolith: Web content extractionhtmd: HTML to Markdown conversiontokio: Async runtimereqwest: HTTP clientlinkify: URL detectionclap: CLI argument parsingindicatif: Progress barshtml5ever: HTML parsing
CI/CD and Release Process
GitHub Actions workflow includes:
- Automated testing on pull requests
- Code quality checks (clippy, fmt)
- Release creation for version tags
- Binary builds for multiple platforms
- Automatic crates.io publishing
Contributing
- Fork the repository
- Create a feature branch
- Install pre-commit hooks:
pre-commit install - Make your changes
- Ensure tests pass:
cargo test - Submit a pull request
Please follow:
- Rust coding conventions
- Comprehensive test coverage
- Clear commit messages
- Documentation updates
License
MIT License - see LICENSE for details.
Author
Adam Twardoch (@twardoch)
For bug reports, feature requests, or general questions, please open an issue on the GitHub repository.