pub fn normalize_url(url: &str) -> StringExpand description
Normalize URL to canonical form for deduplication
Canonical form rules:
- Always HTTPS (prefer secure)
- Remove www. prefix
- Remove trailing slash (except for root /)
- Remove index.html/index.php
- Sort query parameters alphabetically
- Remove fragment (#)
§Arguments
url- The URL to normalize
§Returns
The normalized canonical URL string. Returns original string if parsing fails.
§Examples
use essence::crawler::url_normalization::normalize_url;
assert_eq!(
normalize_url("http://www.example.com/page/"),
"https://example.com/page"
);
assert_eq!(
normalize_url("https://example.com/page/index.html"),
"https://example.com/page"
);
assert_eq!(
normalize_url("https://example.com/page?z=1&a=2"),
"https://example.com/page?a=2&z=1"
);