twars-url2md
twars-url2md is a fast and robust command-line tool written in Rust that fetches web pages, cleans up their HTML content, and converts them into clean Markdown.
You can drop a text that contains URLs onto the app, and it will find all the URLs and save Markdown versions of the pages in a logical folder structure. The output is not perfect, but the tool is fast and robust.
Table of Contents
- Features
- Installation
- Usage
- Configuration & Retry Mechanism
- Development and Testing
- CI/CD and Release Process
- Contributing
- How It Works
- License
- Author
Features
Powerful Web Content Conversion
- Extracts clean web content using Monolith
- Converts web pages to Markdown efficiently
- Handles complex URL and encoding scenarios
Smart URL Handling
- Extracts URLs from various text formats
- Resolves and validates URLs intelligently
- Supports base URL and relative link processing
Flexible Input & Output**
- Multiple input methods (file, stdin, CLI)
- Organized Markdown file generation
- Cross-platform compatibility
Advanced Processing
- Parallel URL processing
- Robust error handling
Install CLI app
☛ Download CLI app for Mac, Windows or Linux
Pre-compiled binary builds for macOS (Apple/Intel), Windows (x86_64), and Linux (x86_64) are on the releases page.
Other ways to install
From Crates.io
Make sure you have Rust (MSRV: 1.70.0 or later) installed, then run:
From Source
Clone the repository and install locally:
Usage
The tool accepts URLs via a file or standard input and converts each page into a Markdown file. It also supports a base URL for resolving relative links.
Input Options
--input <FILE>Read URLs from a specified file (one URL per line)--stdinRead URLs from standard input--base_url <URL>Base URL for resolving relative links- Note: Do not use both
--inputand--stdinsimultaneously
Output Options
--output <DIR>Specify output directory for Markdown files- If no output directory is specified, content is printed to stdout
Output Organization
The tool organizes the output into a directory structure based on the URLs.
- Organizes files by host and path components
- URLs ending in
/or with no path useindex.md - Other URLs use the last path component with
.mdextension
Example structure:
output/
├── example.com/
│ ├── index.md
│ └── articles/
│ └── page.md
└── another-site.com/
└── post/
└── article.md
Examples
# Process a single URL and print to stdout
# Process URLs from a file with specific output directory
# Process piped URLs with base URL for relative links
|
# Show verbose output (disabled by default)
Batch work
# This scans the page for links, and downloads all 260+ pages linked from that page, in about 15 seconds
|
# This downloads 15k+ pages (links from all the files downloaded previously) in 8 minutes
|
Development and Testing
twars-url2md efficiently processes web content through several optimized steps. It starts by extracting valid http(s) URLs using the linkify crate, filtering out malformed links from stdin, files, or command-line inputs.
For each URL, twars-url2md:
- Spawns an asynchronous task with
tokio, scaling concurrent tasks to available CPU cores - Uses
monolithto fetch and clean HTML, removing scripts, styles, and media while preserving document structure - Processes HTML with a custom
html5everparser that maintains document hierarchy and handles character encoding - Converts content to Markdown via
htmd, preserving headings, links, and basic formatting - Implements an exponential backoff retry mechanism for failed requests
- Creates output directories based on URL domain and path using cross-platform
PathBuf
The tool provides comprehensive error handling, catches potential panics, and generates meaningful error messages. It tracks progress for multiple URLs with indicatif and uses rayon for parallel processing of large HTML documents. It processes files in chunks and uses pre-allocated data structures with estimated capacities to achieve memory efficiency.
Running Tests
# Run all tests
# Run with all features
# Run specific test
Code Quality Tools
- Formatting:
cargo fmt - Linting:
cargo clippy --all-targets --all-features - Pre-commit Hooks: Runs formatting, clippy, and basic checks
CI/CD and Release Process
GitHub Actions workflow includes:
- Automated testing on pull requests
- Code quality checks (clippy, fmt)
- Release creation for version tags
- Binary builds for multiple platforms
- Automatic crates.io publishing
License
MIT License - see LICENSE for details.
Author
Adam Twardoch (@twardoch)
For bug reports, feature requests, or general questions, please open an issue on the GitHub repository.