rexturl

A command-line tool for parsing and manipulating URLs with predictable output formats.

Key Features

Clean UX Design

One flag controls format: --format {plain,tsv,csv,json,jsonl,custom,sql}
Precise field selection: --fields domain,path,url
Custom templates: --template '{scheme}://{domain}{path}'
SQL generation: Multi-dialect INSERT statements with proper escaping
Consistent output: Same field order across all formats
Machine-friendly: Proper headers, null handling, exit codes

Technical Implementation

Custom URL parser with optimized component extraction
Zero-copy parsing with minimal allocations
Parallel processing for bulk operations
Multi-part TLD support (co.uk, com.au, etc.)
Template engine with conditional logic and escaping modes
SQL generation with dialect-specific type mapping

Processing Features

Field extraction: scheme, username, host, domain, subdomain, port, path, query, fragment
Data processing: Sort, deduplicate, filter
Input flexibility: Command line args or stdin

Installation

cargo install rexturl

or clone the repository and build from source:

git clone https://github.com/vschwaberow/rexturl.git
cd rexturl
cargo build --release

Quick Start

Extract domain from URL:

rexturl --urls "https://www.example.com/path" --fields domain
# Output: example.com

TSV format with headers:

echo "https://blog.example.co.uk/posts" | rexturl --fields subdomain,domain,path --format tsv --header
# Output:
# subdomain    domain          path
# blog         example.co.uk   /posts

JSON output for APIs:

curl -s api.com/urls | rexturl --fields domain --format json
# Output: {"urls":[{"domain":"api.com"}]}

Usage

rexturl [OPTIONS]

Input Methods

--urls <URLS> - Specify URLs as command-line arguments
stdin - Pipe URLs from other commands (default if no --urls)
Supports single or multiple URLs

Options

Core Options

Option	Values	Description
`--format`	`plain`, `tsv`, `csv`, `json`, `jsonl`, `custom`, `sql`	Output format (default: `plain`)
`--fields`	`domain,path,url`	Comma-separated fields to extract
`--urls`	URL strings	Input URLs to process
`--header`	-	Include header row for tabular formats
`--sort`	-	Sort output by first field
`--unique`	-	Remove duplicate entries

Available Fields

Field	Description	Example
`url`	Original URL string	`https://www.example.com/path`
`scheme`	Protocol	`https`
`username`	Username portion	`user`
`host`/`hostname`	Full hostname	`www.example.com`
`subdomain`	Subdomain only	`www`
`domain`	Registrable domain	`example.com`
`port`	Port number	`8080`
`path`	URL path	`/path`
`query`	Query parameters	`q=search`
`fragment`	Fragment identifier	`section`

Advanced Options

Option	Values	Description
`--pretty`	-	Pretty-print JSON output
`--strict`	-	Exit code 2 if any URL fails to parse
`--no-newline`	-	Suppress trailing newline
`--null-empty`	Custom string	Value for missing fields (default: `\N`)
`--color`	`auto`, `never`, `always`	Colored output for plain format

Custom Format Options

Option	Values	Description
`--template`	Template string	Custom format template (e.g., `'{scheme}://{domain}{path}'`)
`--escape`	`none`, `shell`, `csv`, `json`, `sql`	Escaping mode for custom format

SQL Output Options

Option	Values	Description
`--sql-table`	Table name	SQL table name (default: `urls`)
`--sql-create-table`	-	Include CREATE TABLE statement
`--sql-dialect`	`postgres`, `mysql`, `sqlite`, `generic`	SQL dialect for type mapping

Legacy Field Flags (Still Supported)

These flags automatically add fields - use --fields for explicit control:

Flag	Equivalent	Description
`--domain`	`--fields domain`	Extract domain
`--host`	`--fields subdomain`	Extract subdomain
`--scheme`	`--fields scheme`	Extract scheme
`--path`	`--fields path`	Extract path

Deprecated Options

Option	Use Instead	Description
`--json`	`--format json`	JSON output (deprecated)
`--all`	`--fields` with specific names	All fields (deprecated)
`--custom`	`--format` and `--fields`	Custom format (deprecated)

Examples

Most Common Use Cases

1. Extract domains for analysis:

cat urls.txt | rexturl --fields domain --sort --unique
# Clean list of unique domains

2. Create a spreadsheet-ready CSV:

rexturl --urls "https://api.example.com/v1/users" --fields subdomain,domain,path --format csv --header
# subdomain,domain,path
# api,example.com,/v1/users

3. JSON for APIs and scripts:

curl -s api.com/endpoints | rexturl --fields domain,path --format json
# {"urls":[{"domain":"api.com","path":"/endpoints"}]}

All Format Examples

Plain (default):

rexturl --urls "https://blog.example.com/posts" --fields subdomain,domain,path
# blog example.com /posts

TSV with header:

echo "https://api.example.com/v1" | rexturl --fields subdomain,domain,path --format tsv --header
# subdomain    domain        path
# api          example.com   /v1

CSV for spreadsheets:

rexturl --fields url,domain --format csv --header < urls.txt
# url,domain
# https://www.example.com,example.com

JSON for APIs:

echo "https://api.example.com" | rexturl --fields domain,path --format json --pretty
# {
#   "urls": [
#     {
#       "domain": "example.com", 
#       "path": "/"
#     }
#   ]
# }

JSONL for streaming:

cat large-urls.txt | rexturl --fields domain --format jsonl | head -3
# {"domain":"example.com"}
# {"domain":"api.com"}  
# {"domain":"blog.net"}

Custom format with templates:

rexturl --urls "https://api.example.com/v1/users" --format custom --template "{scheme}://{domain}{path}"
# https://example.com/v1/users

SQL INSERT statements:

rexturl --urls "https://www.example.com/path" --format sql --fields domain,path
# INSERT INTO urls (domain, path) VALUES ('example.com', '/path');

Advanced Examples

Multi-part TLD handling:

rexturl --urls "https://blog.example.co.uk/posts" --fields subdomain,domain,path --format tsv
# blog    example.co.uk    /posts

Handle missing values:

echo "https://example.com" | rexturl --fields domain,port --format tsv --null-empty "N/A"
# example.com    N/A

Error handling with strict mode:

rexturl --urls "not-a-url" --strict --fields domain
# Error: Failed to parse URL: not-a-url
# Exit code: 2

Legacy syntax (still works):

rexturl --urls "https://www.example.com" --domain --path
# example.com /

Domain and Subdomain Extraction

rexturl includes intelligent handling for domains and subdomains:

Multi-part TLD Support: Automatically detects complex TLDs like co.uk, org.uk, com.au, etc.
Domain Extraction: The --domain flag extracts the registrable domain name
Subdomain Extraction: When using --host alone, it extracts the subdomain portion
Smart Detection: Handles edge cases with nested subdomains and international domains

Supported multi-part TLDs include: co.uk, org.uk, ac.uk, gov.uk, me.uk, net.uk, sch.uk, com.au, net.au, org.au, edu.au, gov.au, co.nz, net.nz, org.nz, govt.nz, co.za, org.za, com.br, net.br, org.br, co.jp, com.mx, com.ar, com.sg, com.my, co.id, com.hk, co.th, in.th

Examples:

# Using custom format for specific extraction
echo "https://blog.example.co.uk/posts" | rexturl --format custom --template "Subdomain: {subdomain}, Domain: {domain}"
# Output: Subdomain: blog, Domain: example.co.uk

# Extract all components (tab-separated format)
rexturl --urls "https://user@blog.example.co.uk:8080/posts?q=test#frag" --fields scheme,username,hostname,port,path,query,fragment,domain --format tsv
# Output: https	user	blog.example.co.uk	8080	/posts	q=test	frag	example.co.uk

# Extract components with URLs flag
rexturl --urls "https://blog.example.co.uk/posts" --fields domain
# Output: example.co.uk

Custom Templates

Template Syntax

Use --format custom --template for flexible output formatting:

Basic fields:

{field} - Insert field value or empty string if missing
{field:default} - Insert field value or default if missing
{field?text} - Insert text only if field has a value
{field!text} - Insert text only if field is missing

Available fields:

{scheme} - URL scheme (http, https, etc.)
{username} - Username portion of the URL
{host} - Full hostname
{hostname} - Alias for host
{subdomain} - Subdomain portion (e.g., "www" in www.example.com)
{domain} - Domain name (e.g., "example.com")
{port} - Port number
{path} - URL path
{query} - Query string (without the leading ?)
{fragment} - Fragment identifier (without the leading #)

Escaping modes:

--escape none - No escaping (default)
--escape shell - Shell-safe quoting
--escape csv - CSV-compatible escaping
--escape json - JSON string escaping
--escape sql - SQL value escaping

Template Examples

# Basic template
rexturl --urls "https://example.com/api" --format custom --template "Host: {host}, Path: {path}"
# Output: Host: example.com, Path: /api

# With defaults
rexturl --urls "https://example.com" --format custom --template "{scheme}://{domain}:{port:80}"
# Output: https://example.com:80

# Conditional text
rexturl --urls "https://example.com/path?q=test" --format custom --template "{domain}{query?&found}"
# Output: example.com&found

# Shell escaping
rexturl --urls "https://example.com/path with spaces" --format custom --template "{url}" --escape shell
# Output: 'https://example.com/path with spaces'

SQL Output

Generate SQL INSERT statements from URL data:

# Basic SQL output
rexturl --urls "https://www.example.com/path" --format sql --fields domain,path
# INSERT INTO urls (domain, path) VALUES ('example.com', '/path');

# With CREATE TABLE
rexturl --urls "https://example.com" --format sql --fields domain --sql-create-table
# CREATE TABLE IF NOT EXISTS urls (
#     id SERIAL PRIMARY KEY,
#     domain VARCHAR(253),
#     created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
# );
# INSERT INTO urls (domain) VALUES ('example.com');

# Custom table and dialect
rexturl --urls "https://example.com:3306" --format sql --fields domain,port --sql-table my_urls --sql-dialect mysql
# INSERT INTO my_urls (domain, port) VALUES ('example.com', '3306');

Performance & Architecture

URL Parser Implementation

Custom URL parser with optimized component extraction
Zero-copy parsing with minimal memory allocations
Parallel processing using Rayon for bulk operations

Architecture

Unified data model: Single UrlRecord struct for all formats
Template engine: Flexible custom formatting with conditional logic
SQL generation: Multi-dialect support with proper type mapping
Predictable output: Same field order across all formats
Proper error handling: Exit codes and stderr for failures
Streaming support: Memory-efficient for large datasets

Benchmarks

cargo bench
# fast_url_parsing        time:   [823.79 ns 827.53 ns 831.87 ns]
# fast_url_component_access time: [69.100 ns 69.309 ns 69.527 ns]

Technical Details

Modular design: Separate parsing, formatting, and domain intelligence
Multi-part TLD support: Handles complex domains like example.co.uk
Memory efficient: <1KB overhead per URL

Changelog

For a detailed list of changes and version history, see CHANGELOG.md.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Make your changes with proper tests
Ensure all tests pass (cargo test)
Run formatting and linting (cargo fmt && cargo clippy)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

rexturl 0.4.1