urlnorm

URL normalization library for Rust, mainly designed to normalize URLs for https://progscrape.com.

The normalization algorithm uses the following heuristics:

The scheme of the URL is dropped, so that http://example.com and https://example.com are considered equivalent.
The host is normalized by dropping common prefixes such as www. and m..
The path is normalized by removing duplicate slashes and empty path segments, so that http://example.com//foo/ and http://example.com/foo are considered equlivalent.
The query string parameters are sorted, and any analytics query parameters are removed (ie: utm_XYZ and the like).
Fragments are dropped, with the exception of certain fragment patterns that are recognized as significant (/#/ and #!)

Usage

For long-term storage and clustering of URLs, it is recommended that [UrlNormalizer::compute_normalization_string] is used to compute a representation of the URL that can be compared with standard string comparison operators.

# use url::Url;
# use urlnorm::UrlNormalizer;
let norm = UrlNormalizer::default();
assert_eq!(norm.compute_normalization_string(&Url::parse("http://www.google.com").unwrap()), "google.com:");

For more advanced use cases, the [Options] class allows end-users to provide custom regular expressions for normalization.

urlnorm 0.1.1

urlnorm

Usage