urlnorm
URL normalization library for Rust, mainly designed to normalize URLs for https://progscrape.com.
The normalization algorithm uses the following heuristics:
- The scheme of the URL is dropped, so that
http://example.comandhttps://example.comare considered equivalent. - The host is normalized by dropping common prefixes such as
www.andm.. - The path is normalized by removing duplicate slashes and empty path segments, so that
http://example.com//foo/andhttp://example.com/fooare considered equlivalent. - The query string parameters are sorted, and any analytics query parameters are removed (ie:
utm_XYZand the like). - Fragments are dropped, with the exception of certain fragment patterns that are recognized as significant (
/#/and#!)
Usage
For long-term storage and clustering of URLs, it is recommended that [UrlNormalizer::compute_normalization_string] is used to
compute a representation of the URL that can be compared with standard string comparison operators.
# use url::Url;
# use urlnorm::UrlNormalizer;
let norm = UrlNormalizer::default();
assert_eq!(norm.compute_normalization_string(&Url::parse("http://www.google.com").unwrap()), "google.com:");
For more advanced use cases, the [Options] class allows end-users to provide custom regular expressions for normalization.