sxurl 0.1.0

Fixed-length, sliceable URL identifier system for efficient database storage and querying
Documentation
# SXURL

[![Crates.io](https://img.shields.io/crates/v/sxurl.svg)](https://crates.io/crates/sxurl)
[![Documentation](https://docs.rs/sxurl/badge.svg)](https://docs.rs/sxurl)
[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](#license)
[![Build Status](https://github.com/hjpithadia/sxurl/workflows/CI/badge.svg)](https://github.com/hjpithadia/sxurl/actions)

**SXURL** (pronounced "Sixerl") is a fixed-length, sliceable URL identifier system designed for efficient database storage and querying. It converts URLs into deterministic 256-bit identifiers where each URL component occupies a fixed position, enabling fast substring-based filtering and indexing.

## Features

- **๐Ÿ”’ Fixed-length**: All SXURL identifiers are exactly 256 bits (64 hex characters)
- **๐Ÿ“ Sliceable**: Each URL component has a fixed position for substring filtering
- **๐Ÿ”„ Deterministic**: Same input always produces the same output
- **๐Ÿ›ก๏ธ Collision-resistant**: Uses SHA-256 hashing for component fingerprinting
- **๐ŸŒ Standards-compliant**: Supports IDNA, Public Suffix List, and standard URL schemes
- **โšก Zero-copy**: Efficient parsing and encoding with minimal allocations
- **๐Ÿงช Thoroughly tested**: 100+ comprehensive tests covering edge cases

## Quick Start

Add this to your `Cargo.toml`:

```toml
[dependencies]
sxurl = "0.1"
```

### Basic Usage

```rust
use sxurl::{encode_url_to_hex, decode_hex, matches_component};

// Encode a URL to SXURL
let sxurl_hex = encode_url_to_hex("https://docs.rs/serde")?;
println!("SXURL: {}", sxurl_hex); // 64 hex characters

// Decode and inspect components
let decoded = decode_hex(&sxurl_hex)?;
println!("Scheme: {}", decoded.header.scheme);
println!("Has subdomain: {}", decoded.header.subdomain_present);

// Fast component matching for filtering
let is_docs_rs = matches_component(&sxurl_hex, "domain", "docs")?;
assert!(is_docs_rs);

let is_rs_tld = matches_component(&sxurl_hex, "tld", "rs")?;
assert!(is_rs_tld);
```

### URL Parsing Utilities

```rust
use sxurl::{split_url, parse_query, get_anchor, strip_anchor, join_url_path};

// Parse URL into all components at once
let parts = split_url("https://api.github.com/repos?page=1#readme")?;
println!("Domain: {}", parts.domain);        // "github"
println!("Subdomain: {:?}", parts.subdomain); // Some("api")
println!("Anchor: {:?}", parts.anchor);       // Some("readme")

// Work with query parameters
let params = parse_query("https://example.com?foo=bar&page=2")?;
println!("Page: {:?}", params.get("page"));   // Some("2")

// Handle anchors (fragments)
let anchor = get_anchor("https://docs.rs#search")?; // Some("search")
let clean_url = strip_anchor("https://example.com#top")?; // "https://example.com"

// Join URLs properly
let api_url = join_url_path("https://api.example.com/v1", "users")?;
// Result: "https://api.example.com/v1/users"
```

### Database Integration Example

```rust
use sxurl::encode_url_to_hex;

// Store URLs efficiently in your database
let urls = vec![
    "https://docs.rs/serde",
    "https://crates.io/crates/tokio",
    "https://github.com/rust-lang/rust",
];

for url in urls {
    let sxurl = encode_url_to_hex(url)?;
    // Store sxurl as CHAR(64) or BINARY(32) in your database
    println!("INSERT INTO urls (sxurl, original_url) VALUES ('{}', '{}')", sxurl, url);
}

// Query by domain efficiently using substring matching
// SELECT * FROM urls WHERE sxurl LIKE '_______________example_______________'
//                                       ^domain slice position (chars 7-22)
```

## SXURL Format

The 256-bit SXURL has this fixed layout:

| Component    | Hex Range | Bits | Description                |
|--------------|-----------|------|----------------------------|
| **header**   | [0..3)    | 12   | Version, scheme, flags     |
| **tld_hash** | [3..7)    | 16   | Top-level domain hash      |
| **domain**   | [7..22)   | 60   | Domain name hash           |
| **subdomain**| [22..30)  | 32   | Subdomain hash             |
| **port**     | [30..34)  | 16   | Port number                |
| **path**     | [34..49)  | 60   | Path hash                  |
| **params**   | [49..58)  | 36   | Query parameters hash      |
| **fragment** | [58..64)  | 24   | Fragment hash              |

### Header Format (12 bits)

- **Version** (4 bits): Currently always `1`
- **Scheme** (3 bits): `0`=https, `1`=http, `2`=ftp
- **Flags** (5 bits): Component presence indicators
  - Bit 4: Subdomain present
  - Bit 3: Query parameters present
  - Bit 2: Fragment present
  - Bit 1: Non-default port present
  - Bit 0: Reserved (always 0)

## Advanced Usage

### Custom Encoder

```rust
use sxurl::SxurlEncoder;

let encoder = SxurlEncoder::new();

// Encode to bytes
let sxurl_bytes = encoder.encode("https://example.com")?;
assert_eq!(sxurl_bytes.len(), 32); // Always 32 bytes

// Encode to hex string
let sxurl_hex = encoder.encode_to_hex("https://example.com")?;
assert_eq!(sxurl_hex.len(), 64); // Always 64 hex chars
```

### Component Filtering

```rust
use sxurl::{encode_url_to_hex, matches_component};

let urls = vec![
    "https://api.github.com/repos/rust-lang/rust",
    "https://docs.github.com/en/rest",
    "https://github.blog/2023-01-01/announcement",
];

// Find all GitHub URLs
for url in &urls {
    let sxurl = encode_url_to_hex(url)?;
    if matches_component(&sxurl, "domain", "github")? {
        println!("GitHub URL: {}", url);
    }
}

// Find all API endpoints
for url in &urls {
    let sxurl = encode_url_to_hex(url)?;
    if matches_component(&sxurl, "subdomain", "api")? {
        println!("API URL: {}", url);
    }
}
```

### Hash Function Access

```rust
use sxurl::ComponentHasher;

// Access individual hash functions
let tld_hash = ComponentHasher::hash_tld("com")?;
let domain_hash = ComponentHasher::hash_domain("example")?;
let path_hash = ComponentHasher::hash_path("/api/v1/users")?;

println!("TLD hash: 0x{:04x}", tld_hash);
println!("Domain hash: 0x{:015x}", domain_hash);
println!("Path hash: 0x{:015x}", path_hash);
```

## Supported URL Schemes

- **`https`** (scheme code 0) - Default port 443
- **`http`** (scheme code 1) - Default port 80
- **`ftp`** (scheme code 2) - Default port 21

## Use Cases

### Database Indexing
Store URLs as fixed-length identifiers with efficient B-tree indexing:

```sql
CREATE TABLE url_index (
    sxurl CHAR(64) PRIMARY KEY,
    original_url TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Index by domain (characters 7-22)
CREATE INDEX idx_domain ON url_index (SUBSTRING(sxurl, 8, 15));

-- Index by TLD (characters 3-7)
CREATE INDEX idx_tld ON url_index (SUBSTRING(sxurl, 4, 4));
```

### URL Deduplication
Quickly identify duplicate URLs across different formats:

```rust
use sxurl::encode_url_to_hex;
use std::collections::HashSet;

let mut seen_urls = HashSet::new();

let urls = vec![
    "https://Example.Com/Path",
    "https://example.com/path",    // Same after normalization
    "https://example.com/path/",   // Different path
];

for url in urls {
    let sxurl = encode_url_to_hex(url)?;
    if seen_urls.insert(sxurl) {
        println!("New URL: {}", url);
    } else {
        println!("Duplicate URL: {}", url);
    }
}
```

### Fast Domain Filtering
Filter large URL datasets by domain without parsing:

```rust
// Filter millions of URLs by domain using simple string operations
let domain_filter = "1a2b3c4d5e6f7890abc"; // Hash of target domain
let urls_in_domain: Vec<_> = all_sxurls
    .iter()
    .filter(|sxurl| &sxurl[7..22] == domain_filter)
    .collect();
```

## Error Handling

All functions return `Result<T, SxurlError>`. Common error cases:

```rust
use sxurl::{encode_url_to_hex, SxurlError};

match encode_url_to_hex("invalid-url") {
    Ok(sxurl) => println!("Success: {}", sxurl),
    Err(SxurlError::InvalidScheme) => println!("Unsupported URL scheme"),
    Err(SxurlError::HostNotDns) => println!("Host must be a DNS name, not IP"),
    Err(SxurlError::InvalidLength) => println!("Invalid SXURL format"),
    Err(e) => println!("Other error: {}", e),
}
```

## Performance

SXURL is designed for high-performance applications:

- **Encoding**: ~1-5 ฮผs per URL (depending on complexity)
- **Decoding**: ~100-500 ns per SXURL
- **Component matching**: ~50-100 ns per comparison
- **Memory usage**: Fixed 32 bytes per SXURL, minimal temporary allocations

Benchmarks on a modern CPU:
```
encode_simple_url      time: 1.2 ฮผs
encode_complex_url     time: 4.8 ฮผs
decode_sxurl          time: 245 ns
component_match       time: 67 ns
```

## Technical Details

### Hash Function
SXURL uses labeled SHA-256 hashing: `H_n(label, data) = lower_n(SHA256(label || 0x00 || data))`

- Collision resistance: ~2^(n/2) where n is the bit width
- Domain separation: Different labels produce different hashes for same data
- Deterministic: Same input always produces same output

### URL Normalization
- Scheme and host converted to lowercase
- IDNA (Internationalized Domain Names) support
- Public Suffix List (PSL) for proper domain/TLD splitting
- IP addresses rejected (DNS names only)

### Component Handling
- Empty components stored as zero (not hashed)
- Default ports (80, 443, 21) stored explicitly
- Query parameters and fragments preserved as-is
- Path normalization preserves original structure

## Testing

Run the comprehensive test suite:

```bash
cargo test                    # Run all tests
cargo test --doc              # Test documentation examples
cargo test --release          # Test optimized builds
cargo bench                   # Run performance benchmarks
```

The library includes 100+ tests covering:
- Specification compliance
- Edge cases and error conditions
- Round-trip consistency
- Hash collision resistance
- Performance regression testing

## Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Ensure all tests pass
5. Submit a pull request

## Specification

SXURL follows a formal specification available in [SXURL-SPEC.md](SXURL-SPEC.md). The implementation is designed to be compatible with other SXURL implementations in different languages.

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for detailed release history.

## License

Licensed under the Apache License, Version 2.0 ([LICENSE](LICENSE) or http://www.apache.org/licenses/LICENSE-2.0).

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be licensed under the Apache License, Version 2.0, without any additional terms or conditions.

---

**SXURL** - Efficient, deterministic URL identifiers for modern applications.