Struct Normalizer

Source

pub struct Normalizer;

Expand description

Stateless namespace for text normalisation routines.

Normalizer is a unit type with no fields; every method is associated. It is held as a struct rather than a free function module purely so the public API has a single, discoverable entry point.

use thing_matcher::Normalizer;

let canonical = Normalizer::normalize_name("José-María");
assert_eq!(canonical, "josemaria");

Implementations§

Source §

impl Normalizer

Source

pub fn normalize_name(name: &str) -> String

Normalise a name for comparison.

Steps, in order:

Decompose to Unicode NFKD form (é → e + combining acute).
Drop combining marks (diacritics).
Drop ASCII punctuation (apostrophes, hyphens, full stops, …).
Lowercase.
Collapse consecutive whitespace to single ASCII spaces; trim ends.

The result is suitable for direct equality comparison or for feeding into a string-similarity scorer.

§Examples

Whitespace is collapsed and trimmed:

use thing_matcher::Normalizer;
assert_eq!(Normalizer::normalize_name("  John  Smith  "), "john smith");

Apostrophes and hyphens are stripped:

assert_eq!(Normalizer::normalize_name("O'Brien"),    "obrien");
assert_eq!(Normalizer::normalize_name("MARY-JANE"),  "maryjane");

Diacritics are removed:

assert_eq!(Normalizer::normalize_name("Siân"),    "sian");
assert_eq!(Normalizer::normalize_name("café"),    "cafe");
// Letters with an integral stroke do not decompose under NFKD, so
// they pass through (lowercased), while the combining acute on `ó`
// and `ź` is stripped:
assert_eq!(Normalizer::normalize_name("Łódź"),    "łodz");

Source

pub fn normalize_text(text: &str) -> String

Normalise free-form text (descriptions, etc.) for similarity scoring.

Like Normalizer::normalize_name, but keeps ASCII punctuation — punctuation carries information in longer text (sentence boundaries, abbreviations) that should not be discarded.

Steps, in order:

Decompose to Unicode NFKD form.
Drop combining marks (diacritics).
Lowercase.
Collapse consecutive whitespace to single ASCII spaces; trim ends.

§Examples

use thing_matcher::Normalizer;
assert_eq!(
    Normalizer::normalize_text("  The Eiffel Tower, in Paris.  "),
    "the eiffel tower, in paris.",
);
assert_eq!(
    Normalizer::normalize_text("café au lait"),
    "cafe au lait",
);

Source

pub fn normalize_url(url: &str) -> String

Normalise a URL for equality comparison.

The transformation is lossless enough for matching but not a full URL canonicalisation:

Trim surrounding whitespace.
Lowercase the scheme and host portions (HTTPS://Example.ORG → https://example.org). The path is left case-sensitive.
Drop a trailing slash from a root path (https://x.org/ → https://x.org). Non-root trailing slashes are kept, because /foo and /foo/ are legitimately different on many servers.
Drop a #fragment suffix — fragments do not travel over HTTP and never identify a different resource.

No percent-encoding canonicalisation is attempted; callers that need strict canonical URLs should pre-process the input.

§Examples

use thing_matcher::Normalizer;
assert_eq!(
    Normalizer::normalize_url("HTTPS://Example.ORG/"),
    "https://example.org",
);
assert_eq!(
    Normalizer::normalize_url("  https://EXAMPLE.org/foo  "),
    "https://example.org/foo",
);
assert_eq!(
    Normalizer::normalize_url("https://example.org/foo/#bar"),
    "https://example.org/foo/",
);

Strings that are not URL-shaped are returned trimmed + lowercased so they remain comparable as opaque identifiers:

assert_eq!(Normalizer::normalize_url("  URN:ISBN:0451450523  "), "urn:isbn:0451450523");

Source

pub fn phonetic_code(name: &str) -> String

Soundex-like phonetic code for an ASCII-ish name, used as a coarse blocking key and as the gate for the phonetic-bonus in the matcher.

Implementation note: delegates to the soundex crate after first applying Normalizer::normalize_name. Returns an empty string when the input is empty or normalises to an empty string.

§Examples

use thing_matcher::Normalizer;
let a = Normalizer::phonetic_code("Stephen");
let b = Normalizer::phonetic_code("Steven");
assert!(!a.is_empty());
assert_eq!(a, b);

Auto Trait Implementations§

§

impl UnwindSafe for Normalizer

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> IntoEither for T

Source §

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

impl<T, U> TryFrom for T
where U: Into<T>,

Source §

type Error = Infallible

The type returned in the event of a conversion error.

Source §

fn try_from(value: U) -> Result<T, <T as TryFrom>::Error>

Performs the conversion.

Source §

impl<T, U> TryInto for T
where U: TryFrom<T>,

Source §

type Error = >::Error

The type returned in the event of a conversion error.

Source §

fn try_into(self) -> Result<U, >::Error>

Performs the conversion.

Normalizer

Struct Normalizer

Implementations§

impl Normalizer

pub fn normalize_name(name: &str) -> String

§Examples

pub fn normalize_text(text: &str) -> String

§Examples

pub fn normalize_url(url: &str) -> String

§Examples

pub fn phonetic_code(name: &str) -> String

§Examples

Auto Trait Implementations§

impl Freeze for Normalizer

impl RefUnwindSafe for Normalizer

impl Send for Normalizer

impl Sync for Normalizer

impl Unpin for Normalizer

impl UnsafeUnpin for Normalizer

impl UnwindSafe for Normalizer

Blanket Implementations§

impl<T> Any for T
where T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for T
where T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for T
where T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for T
where U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Normalizer

Struct Normalizer Copy item path

Implementations§

impl Normalizer

pub fn normalize_name(name: &str) -> String

§Examples

pub fn normalize_text(text: &str) -> String

§Examples

pub fn normalize_url(url: &str) -> String

§Examples

pub fn phonetic_code(name: &str) -> String

§Examples

Auto Trait Implementations§

impl Freeze for Normalizer

impl RefUnwindSafe for Normalizer

impl Send for Normalizer

impl Sync for Normalizer

impl Unpin for Normalizer

impl UnsafeUnpin for Normalizer

impl UnwindSafe for Normalizer

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Struct Normalizer

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,