Skip to main content

normalize_url

Function normalize_url 

Source
pub fn normalize_url(url: &str) -> String
Expand description

Normalize URL to canonical form for deduplication

Canonical form rules:

  1. Always HTTPS (prefer secure)
  2. Remove www. prefix
  3. Remove trailing slash (except for root /)
  4. Remove index.html/index.php
  5. Sort query parameters alphabetically
  6. Remove fragment (#)

§Arguments

  • url - The URL to normalize

§Returns

The normalized canonical URL string. Returns original string if parsing fails.

§Examples

use essence::crawler::url_normalization::normalize_url;

assert_eq!(
    normalize_url("http://www.example.com/page/"),
    "https://example.com/page"
);

assert_eq!(
    normalize_url("https://example.com/page/index.html"),
    "https://example.com/page"
);

assert_eq!(
    normalize_url("https://example.com/page?z=1&a=2"),
    "https://example.com/page?a=2&z=1"
);