Skip to main content

Normalizer

Struct Normalizer 

Source
pub struct Normalizer;
Expand description

Stateless namespace for text normalisation routines.

Normalizer is a unit type with no fields; every method is associated. It is held as a struct rather than a free function module purely so the public API has a single, discoverable entry point.

use thing_matcher::Normalizer;

let canonical = Normalizer::normalize_name("José-María");
assert_eq!(canonical, "josemaria");

Implementations§

Source§

impl Normalizer

Source

pub fn normalize_name(name: &str) -> String

Normalise a name for comparison.

Steps, in order:

  1. Decompose to Unicode NFKD form (ée + combining acute).
  2. Drop combining marks (diacritics).
  3. Drop ASCII punctuation (apostrophes, hyphens, full stops, …).
  4. Lowercase.
  5. Collapse consecutive whitespace to single ASCII spaces; trim ends.

The result is suitable for direct equality comparison or for feeding into a string-similarity scorer.

§Examples

Whitespace is collapsed and trimmed:

use thing_matcher::Normalizer;
assert_eq!(Normalizer::normalize_name("  John  Smith  "), "john smith");

Apostrophes and hyphens are stripped:

assert_eq!(Normalizer::normalize_name("O'Brien"),    "obrien");
assert_eq!(Normalizer::normalize_name("MARY-JANE"),  "maryjane");

Diacritics are removed:

assert_eq!(Normalizer::normalize_name("Siân"),    "sian");
assert_eq!(Normalizer::normalize_name("café"),    "cafe");
// Letters with an integral stroke do not decompose under NFKD, so
// they pass through (lowercased), while the combining acute on `ó`
// and `ź` is stripped:
assert_eq!(Normalizer::normalize_name("Łódź"),    "łodz");
Source

pub fn normalize_text(text: &str) -> String

Normalise free-form text (descriptions, etc.) for similarity scoring.

Like Normalizer::normalize_name, but keeps ASCII punctuation — punctuation carries information in longer text (sentence boundaries, abbreviations) that should not be discarded.

Steps, in order:

  1. Decompose to Unicode NFKD form.
  2. Drop combining marks (diacritics).
  3. Lowercase.
  4. Collapse consecutive whitespace to single ASCII spaces; trim ends.
§Examples
use thing_matcher::Normalizer;
assert_eq!(
    Normalizer::normalize_text("  The Eiffel Tower, in Paris.  "),
    "the eiffel tower, in paris.",
);
assert_eq!(
    Normalizer::normalize_text("café au lait"),
    "cafe au lait",
);
Source

pub fn normalize_url(url: &str) -> String

Normalise a URL for equality comparison.

The transformation is lossless enough for matching but not a full URL canonicalisation:

  1. Trim surrounding whitespace.
  2. Lowercase the scheme and host portions (HTTPS://Example.ORGhttps://example.org). The path is left case-sensitive.
  3. Drop a trailing slash from a root path (https://x.org/https://x.org). Non-root trailing slashes are kept, because /foo and /foo/ are legitimately different on many servers.
  4. Drop a #fragment suffix — fragments do not travel over HTTP and never identify a different resource.

No percent-encoding canonicalisation is attempted; callers that need strict canonical URLs should pre-process the input.

§Examples
use thing_matcher::Normalizer;
assert_eq!(
    Normalizer::normalize_url("HTTPS://Example.ORG/"),
    "https://example.org",
);
assert_eq!(
    Normalizer::normalize_url("  https://EXAMPLE.org/foo  "),
    "https://example.org/foo",
);
assert_eq!(
    Normalizer::normalize_url("https://example.org/foo/#bar"),
    "https://example.org/foo/",
);

Strings that are not URL-shaped are returned trimmed + lowercased so they remain comparable as opaque identifiers:

assert_eq!(Normalizer::normalize_url("  URN:ISBN:0451450523  "), "urn:isbn:0451450523");
Source

pub fn phonetic_code(name: &str) -> String

Soundex-like phonetic code for an ASCII-ish name, used as a coarse blocking key and as the gate for the phonetic-bonus in the matcher.

Implementation note: delegates to the soundex crate after first applying Normalizer::normalize_name. Returns an empty string when the input is empty or normalises to an empty string.

§Examples
use thing_matcher::Normalizer;
let a = Normalizer::phonetic_code("Stephen");
let b = Normalizer::phonetic_code("Steven");
assert!(!a.is_empty());
assert_eq!(a, b);

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.