pub struct Normalizer;Expand description
Stateless namespace for text normalisation routines.
Normalizer is a unit type with no fields; every method is associated.
It is held as a struct rather than a free function module purely so the
public API has a single, discoverable entry point.
use thing_matcher::Normalizer;
let canonical = Normalizer::normalize_name("José-María");
assert_eq!(canonical, "josemaria");Implementations§
Source§impl Normalizer
impl Normalizer
Sourcepub fn normalize_name(name: &str) -> String
pub fn normalize_name(name: &str) -> String
Normalise a name for comparison.
Steps, in order:
- Decompose to Unicode NFKD form (
é→e+ combining acute). - Drop combining marks (diacritics).
- Drop ASCII punctuation (apostrophes, hyphens, full stops, …).
- Lowercase.
- Collapse consecutive whitespace to single ASCII spaces; trim ends.
The result is suitable for direct equality comparison or for feeding into a string-similarity scorer.
§Examples
Whitespace is collapsed and trimmed:
use thing_matcher::Normalizer;
assert_eq!(Normalizer::normalize_name(" John Smith "), "john smith");Apostrophes and hyphens are stripped:
assert_eq!(Normalizer::normalize_name("O'Brien"), "obrien");
assert_eq!(Normalizer::normalize_name("MARY-JANE"), "maryjane");Diacritics are removed:
assert_eq!(Normalizer::normalize_name("Siân"), "sian");
assert_eq!(Normalizer::normalize_name("café"), "cafe");
// Letters with an integral stroke do not decompose under NFKD, so
// they pass through (lowercased), while the combining acute on `ó`
// and `ź` is stripped:
assert_eq!(Normalizer::normalize_name("Łódź"), "łodz");Sourcepub fn normalize_text(text: &str) -> String
pub fn normalize_text(text: &str) -> String
Normalise free-form text (descriptions, etc.) for similarity scoring.
Like Normalizer::normalize_name, but keeps ASCII punctuation —
punctuation carries information in longer text (sentence boundaries,
abbreviations) that should not be discarded.
Steps, in order:
- Decompose to Unicode NFKD form.
- Drop combining marks (diacritics).
- Lowercase.
- Collapse consecutive whitespace to single ASCII spaces; trim ends.
§Examples
use thing_matcher::Normalizer;
assert_eq!(
Normalizer::normalize_text(" The Eiffel Tower, in Paris. "),
"the eiffel tower, in paris.",
);
assert_eq!(
Normalizer::normalize_text("café au lait"),
"cafe au lait",
);Sourcepub fn normalize_url(url: &str) -> String
pub fn normalize_url(url: &str) -> String
Normalise a URL for equality comparison.
The transformation is lossless enough for matching but not a full URL canonicalisation:
- Trim surrounding whitespace.
- Lowercase the scheme and host portions (
HTTPS://Example.ORG→https://example.org). The path is left case-sensitive. - Drop a trailing slash from a root path (
https://x.org/→https://x.org). Non-root trailing slashes are kept, because/fooand/foo/are legitimately different on many servers. - Drop a
#fragmentsuffix — fragments do not travel over HTTP and never identify a different resource.
No percent-encoding canonicalisation is attempted; callers that need strict canonical URLs should pre-process the input.
§Examples
use thing_matcher::Normalizer;
assert_eq!(
Normalizer::normalize_url("HTTPS://Example.ORG/"),
"https://example.org",
);
assert_eq!(
Normalizer::normalize_url(" https://EXAMPLE.org/foo "),
"https://example.org/foo",
);
assert_eq!(
Normalizer::normalize_url("https://example.org/foo/#bar"),
"https://example.org/foo/",
);Strings that are not URL-shaped are returned trimmed + lowercased so they remain comparable as opaque identifiers:
assert_eq!(Normalizer::normalize_url(" URN:ISBN:0451450523 "), "urn:isbn:0451450523");Sourcepub fn phonetic_code(name: &str) -> String
pub fn phonetic_code(name: &str) -> String
Soundex-like phonetic code for an ASCII-ish name, used as a coarse blocking key and as the gate for the phonetic-bonus in the matcher.
Implementation note: delegates to the soundex crate after first
applying Normalizer::normalize_name. Returns an empty string
when the input is empty or normalises to an empty string.
§Examples
use thing_matcher::Normalizer;
let a = Normalizer::phonetic_code("Stephen");
let b = Normalizer::phonetic_code("Steven");
assert!(!a.is_empty());
assert_eq!(a, b);Auto Trait Implementations§
impl Freeze for Normalizer
impl RefUnwindSafe for Normalizer
impl Send for Normalizer
impl Sync for Normalizer
impl Unpin for Normalizer
impl UnsafeUnpin for Normalizer
impl UnwindSafe for Normalizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more