Crate urlnorm

source ·
Expand description


Build Status

URL normalization library for Rust, mainly designed to normalize URLs for

The normalization algorithm uses the following heuristics:

  • The scheme of the URL is dropped, so that and are considered equivalent.
  • The host is normalized by dropping common prefixes such as www. and m..
  • The path is normalized by removing duplicate slashes and empty path segments, so that and are considered equivalent.
  • The query string parameters are sorted, and any analytics query parameters are removed (ie: utm_XYZ and the like).
  • Fragments are dropped, with the exception of certain fragment patterns that are recognized as significant (/#/ and #!)


For long-term storage and clustering of URLs, it is recommended that UrlNormalizer::compute_normalization_string is used to compute a representation of the URL that can be compared with standard string comparison operators.

The normalization strings are not a perfect clustering algorithm for content, but they will tend to cluster URLs pointing to the same data together. For a more accurate clustering algorithm, this library can be paired with a more advanced DUST-aware processing algorithm (for example, see DustBuster from “Do Not Crawl in the DUST: Different URLs with Similar Text”).

let norm = UrlNormalizer::default();
let url = Url::parse("").unwrap();
assert_eq!(norm.compute_normalization_string(&url), "");

For more advanced use cases, the Options class allows end-users to provide custom regular expressions for normalization.


The normalization string gives an idea of what parts of the URL are considered significant:


  • Defines how URL normalization will work. This struct offers reasonable defaults, as well as a fluent interface for building normalization.
  • A fully-constructed normalizer instance.