Thing matcher Rust Crate
A Rust library for deciding whether two records describe the same
schema.org/Thing.
What it does
thing-matcher compares pairs of [Thing] records — books, articles,
landmarks, products, organisations, people, events, or any other entity
that can be described with the
schema.org/Thing vocabulary — and tells you
whether they refer to the same item. It is built for de-duplication and
record linkage across data sources that disagree on names, encodings, and
identifier schemes.
The crate provides two strategies behind one engine:
- Deterministic — a hard
boolwhen both records share a scheme-scopedidentifier(Wikidata QID, ISBN, DOI, GTIN, …), or asameAsreference URL, or a canonicalurl. - Probabilistic — a weight-renormalised score in
[0.0, 1.0]overname,description,disambiguatingDescription,identifier,url,sameAs,image,mainEntityOfPage, andadditionalType, with a per-field [MatchBreakdown] so every decision is auditable.
The library is pure: no IO, no clocks, no RNGs, #![forbid(unsafe_code)],
Send + Sync. It is suitable as a leaf dependency under web servers,
batch jobs, or notebooks.
Installation
[]
= "0.4"
Quick start — probabilistic match
use ;
let a = builder
.name
.url
.build;
let b = builder
.name
.add_alternate_name
.url
.build;
let engine = default_config;
let result = engine.match_things;
assert!;
println!;
println!;
println!;
The MatchBreakdown carries one Option<f64> per scored field; a None
means the field was absent on at least one side and so did not
contribute. Missing fields neither raise nor lower the score.
Quick start — deterministic match
A shared external identifier — any (property_id, value) pair across
the two records — is enough on its own:
use ;
let id = new.unwrap;
let a = builder.name.add_identifier.build;
let b = builder.name.add_identifier.build;
let engine = default_config;
assert!;
A shared sameAs URL or a shared canonical url is also accepted as a
deterministic match — sameAs exists in schema.org precisely to point
at "a reference Web page that unambiguously indicates the item's
identity".
The Thing model
The Thing data model mirrors the
schema.org/Thing vocabulary:
| Rust field | schema.org property | Purpose |
|---|---|---|
name |
name |
Primary canonical name. |
alternate_names |
alternateName |
Aliases, endonyms, translations. The matcher takes the best score across the cartesian product. |
description |
description |
Free-form description. Compared as text. |
disambiguating_description |
disambiguatingDescription |
Short disambiguating description. Compared as text. |
identifiers |
identifier (as PropertyValue) |
Scheme-scoped external identifiers. Sharing one is a deterministic match. |
url |
url |
Canonical URL. Compared after URL normalisation. |
image |
image |
URL of a representative image. |
same_as |
sameAs |
Reference URLs that unambiguously identify the item. |
main_entity_of_page |
mainEntityOfPage |
Page (URL) for which this thing is the main entity. |
additional_types |
additionalType |
Additional types from external vocabularies (e.g. https://schema.org/Landmark). |
subject_of |
subjectOf |
Works or events about this thing (URLs). |
owner |
owner |
Person or organisation that owns this thing. |
local_id |
— | Originating-system identifier. Not scored. |
Build records via the fluent [Thing::builder]; all setters accept
impl Into<String>.
The match pipeline
- Each scoring component yields
Some(score)in[0.0, 1.0]orNone(missing on at least one side). - The weighted sum runs over components that scored.
- The sum of participating weights divides through — renormalisation ensures missing fields do not penalise.
- A phonetic-name match (Soundex on normalised names) adds a
0.05-weighted bonus when the gating phonetic score exceeds0.9, only whenuse_phonetic_matchingis on. The bonus never lowers a score. is_match = score >= match_threshold(strict mode additionally requiresdeterministic_match).confidence = Confidence::from_score(score)— bands are fixed (>= 0.90High,>= 0.75Medium, else Low) and independent ofmatch_threshold.
Default weights
| Component | Weight | Notes |
|---|---|---|
| Name | 0.30 |
Best of cartesian product across primary + alternates, via the configured SimilarityAlgorithm (default Combined = 0.7 × Jaro-Winkler + 0.3 × Levenshtein). |
| Description | 0.10 |
Combined similarity over the normalised text. |
| Disambiguating description | 0.05 |
Combined similarity over the normalised text. |
| Identifiers | 0.25 |
1.0 if any (property_id, value) pair is shared, 0.0 if both non-empty but no overlap, None if either empty. |
| URL | 0.05 |
Exact equality after URL normalisation. |
| sameAs | 0.15 |
Jaccard set similarity over normalised URLs. |
| Image | 0.03 |
Exact equality after URL normalisation. |
| mainEntityOfPage | 0.02 |
Exact equality after URL normalisation. |
| additionalType | 0.05 |
Jaccard set similarity over normalised URIs. |
| Phonetic bonus | +0.05 when gated |
Bonus only — never lowers a score. |
Configuration presets
use ;
let strict = new; // threshold 0.95, requires deterministic
let default = default_config; // threshold 0.80
let lenient = new; // threshold 0.65, phonetic on
- Default (0.80) — balanced for everyday de-duplication.
- Strict (0.95) — for downstream systems that must rely on the
answer;
is_matchadditionally requiresdeterministic_match.scoreandconfidenceare unaffected. - Lenient (0.65) — for triaging large candidate sets where false negatives are costlier than false positives.
Every field of MatchConfig is overridable; the config is Serialize + Deserialize (with #[serde(default)]) so tunings can live in a file:
use MatchConfig;
let cfg: MatchConfig = from_str.unwrap;
Batch scoring
use ;
let engine = default_config;
let query = builder.name.build;
let candidates = vec!;
// Parallel to the input slice:
let results = engine.match_one_to_many;
// Sorted by descending score, deterministic tiebreak on original index:
let ranked = engine.rank_one_to_many;
let = &ranked;
println!;
The engine is Send + Sync. Wrap calls in rayon::par_iter (or any
parallelism primitive) without changes to this crate. Candidate
pre-filtering — Soundex prefix blocking, identifier-scheme blocking,
sameAs-prefix blocking — is intentionally a consumer concern.
Determinism and safety
#![forbid(unsafe_code)]at the crate root.- No IO. The library does not read files, open sockets, or log.
- No clocks, no RNGs, no environment variables. Same inputs always produce the same outputs.
- No panics in library code; every fallible parser returns
Noneand every fallible operation returnsResult. Send + Sync. Engines are immutable after construction and cheap to clone.- Serde-clean. Every public data type round-trips through
serde_json(and any otherserdeformat).
Limitations / out of scope
- Not a triple store. This crate does not store or query
schema.org graphs; it only compares pairs of in-memory
Thingrecords. - No JSON-LD parser. Inputs are built via the [
Thing::builder] — callers are responsible for translatingschema.orgJSON-LD to theThingshape if they need to ingest it. - No full URL canonicalisation. [
Normalizer::normalize_url] lowercases the scheme and host and trims a root trailing slash, but does not perform percent-encoding canonicalisation or punycode decoding. - No machine learning. Scoring is rule-based and transparent; weights are tuneable but the algorithm is fixed.
- No persistence layer. The crate scores pairs in memory; storage and indexing belong upstream.
License
MIT OR Apache-2.0 OR GPL-2.0 OR GPL-3.0 OR BSD-3-Clause — see
LICENSE.md.
Contributing
Contributions welcome. Before opening a PR:
cargo fmtcargo clippy --all-targets -- -D warningscargo test
See AGENTS.md for the working guide.
Contact
Joel Parker Henderson — joel@joelparkerhenderson.com.