Skip to main content

Crate thing_matcher

Crate thing_matcher 

Source
Expand description

§Thing matcher

A Rust library for matching records that describe schema.org/Thing entities. The crate implements both deterministic and probabilistic matching algorithms.

The library is deterministic, stateless, panic-free in library code, and Send + Sync so it can be used freely across threads.

§What it does

Given two Thing records — typically drawn from different source systems — the MatchingEngine decides whether they refer to the same item. The output is either a hard boolean (deterministic) or a scored MatchResult with a per-field matcher::MatchBreakdown so an auditor or downstream system can inspect the decision.

The data model follows schema.org/Thing — the root type used to describe any kind of item on the web. The crate compares the identity-bearing properties of that vocabulary: name, alternateName, description, disambiguatingDescription, identifier, url, image, sameAs, mainEntityOfPage, additionalType, subjectOf, and owner.

§Crate layout

ModulePurpose
modelsData types: Thing, ThingBuilder, Identifier.
normalizerText normalisation: names, free text, URLs, phonetic codes.
scorerString-similarity and set-similarity primitives.
matcherOrchestration: MatchingEngine, MatchConfig, MatchResult.
errorError enum MatchingError and Result alias.

§Quick start — probabilistic match

use thing_matcher::{MatchingEngine, MatchConfig, Thing};

let a = Thing::builder()
    .name("Eiffel Tower")
    .add_alternate_name("La Tour Eiffel")
    .url("https://www.toureiffel.paris/")
    .build();

let b = Thing::builder()
    .name("Tour Eiffel")
    .url("https://www.toureiffel.paris/")
    .build();

let engine = MatchingEngine::new(MatchConfig::default());
let result = engine.match_things(&a, &b);

assert!(result.is_match);

§Inspecting the per-field breakdown

Every probabilistic match returns a per-field score so the decision is auditable end-to-end. Missing or unparseable fields score None rather than zero — they do not penalise the thing.

use thing_matcher::{MatchingEngine, Thing};

let p = Thing::builder()
    .name("Big Ben")
    .url("https://en.wikipedia.org/wiki/Big_Ben")
    .build();
let q = p.clone();

let result = MatchingEngine::default_config().match_things(&p, &q);
assert!(result.breakdown.name_score.unwrap() > 0.99);
assert_eq!(result.breakdown.url_score, Some(1.0));

§Configuration presets

Three configurations cover most use cases. Use MatchConfig::strict when callers must rely on the answer; use MatchConfig::lenient to triage large candidate sets where false negatives are worse than false positives.

use thing_matcher::{MatchConfig, MatchingEngine};

let strict   = MatchingEngine::new(MatchConfig::strict());
let default  = MatchingEngine::default_config();
let lenient  = MatchingEngine::new(MatchConfig::lenient());

// All three engines share the same scoring pipeline; only the
// threshold and a couple of weights differ.

§Determinism and safety

  • Deterministic. Same inputs => same outputs. No clocks, no RNGs, no environment variables.
  • No unsafe. This crate forbids unsafe code.
  • No IO. The library does not log, read files, or open sockets.
  • No panics in library code paths; every fallible input returns None from a scorer or a MatchingError.

Re-exports§

pub use error::MatchingError;
pub use error::Result;
pub use matcher::Confidence;
pub use matcher::MatchBreakdown;
pub use matcher::MatchConfig;
pub use matcher::MatchResult;
pub use matcher::MatchingEngine;
pub use models::Identifier;
pub use models::Thing;
pub use models::ThingBuilder;
pub use normalizer::Normalizer;
pub use scorer::Scorer;
pub use scorer::SimilarityAlgorithm;

Modules§

error
Error types for thing-matcher operations.
matcher
Thing matcher engine: deterministic and probabilistic algorithms.
models
Data models for things, aligned with schema.org/Thing.
normalizer
Text normalisation for Thing matching.
scorer
Scoring algorithms for string similarity and field comparison.