Expand description
URL Normalization and Permutation Generation for Crawl Deduplication
This module provides comprehensive URL normalization to prevent duplicate scraping of the same URL with different permutations (www/non-www, http/https, trailing slash, etc.).
Expected impact: 5-10% crawl efficiency improvement by reducing duplicate requests.
Functionsยง
- generate_
url_ permutations - Generate all URL permutations for deduplication (returns ~16 variations)
- normalize_
url - Normalize URL to canonical form for deduplication