1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
//! # Thing matcher
//!
//! A Rust library for matching records that describe `schema.org/Thing`
//! entities. The crate implements both **deterministic** and
//! **probabilistic** matching algorithms.
//!
//! The library is **deterministic**, **stateless**, **panic-free** in
//! library code, and **`Send + Sync`** so it can be used freely across
//! threads.
//!
//! ## What it does
//!
//! Given two [`Thing`] records — typically drawn from different source
//! systems — the [`MatchingEngine`] decides whether they refer to the
//! same item. The output is either a hard boolean (deterministic) or a
//! scored [`MatchResult`] with a per-field [`matcher::MatchBreakdown`] so
//! an auditor or downstream system can inspect the decision.
//!
//! The data model follows `schema.org/Thing` — the root type used to
//! describe any kind of item on the web. The crate compares the
//! identity-bearing properties of that vocabulary: `name`,
//! `alternateName`, `description`, `disambiguatingDescription`,
//! `identifier`, `url`, `image`, `sameAs`, `mainEntityOfPage`,
//! `additionalType`, `subjectOf`, and `owner`.
//!
//! ## Crate layout
//!
//! | Module | Purpose |
//! |---|---|
//! | [`models`] | Data types: [`Thing`], [`ThingBuilder`], [`Identifier`]. |
//! | [`normalizer`] | Text normalisation: names, free text, URLs, phonetic codes. |
//! | [`scorer`] | String-similarity and set-similarity primitives. |
//! | [`matcher`] | Orchestration: [`MatchingEngine`], [`MatchConfig`], [`MatchResult`]. |
//! | [`error`] | Error enum [`MatchingError`] and [`Result`] alias. |
//!
//! ## Quick start — probabilistic match
//!
//! ```
//! use thing_matcher::{MatchingEngine, MatchConfig, Thing};
//!
//! let a = Thing::builder()
//! .name("Eiffel Tower")
//! .add_alternate_name("La Tour Eiffel")
//! .url("https://www.toureiffel.paris/")
//! .build();
//!
//! let b = Thing::builder()
//! .name("Tour Eiffel")
//! .url("https://www.toureiffel.paris/")
//! .build();
//!
//! let engine = MatchingEngine::new(MatchConfig::default());
//! let result = engine.match_things(&a, &b);
//!
//! assert!(result.is_match);
//! ```
//!
//! ## Inspecting the per-field breakdown
//!
//! Every probabilistic match returns a per-field score so the decision is
//! auditable end-to-end. Missing or unparseable fields score `None`
//! rather than zero — they do not penalise the thing.
//!
//! ```
//! use thing_matcher::{MatchingEngine, Thing};
//!
//! let p = Thing::builder()
//! .name("Big Ben")
//! .url("https://en.wikipedia.org/wiki/Big_Ben")
//! .build();
//! let q = p.clone();
//!
//! let result = MatchingEngine::default_config().match_things(&p, &q);
//! assert!(result.breakdown.name_score.unwrap() > 0.99);
//! assert_eq!(result.breakdown.url_score, Some(1.0));
//! ```
//!
//! ## Configuration presets
//!
//! Three configurations cover most use cases. Use [`MatchConfig::strict`]
//! when callers must rely on the answer; use [`MatchConfig::lenient`] to
//! triage large candidate sets where false negatives are worse than false
//! positives.
//!
//! ```
//! use thing_matcher::{MatchConfig, MatchingEngine};
//!
//! let strict = MatchingEngine::new(MatchConfig::strict());
//! let default = MatchingEngine::default_config();
//! let lenient = MatchingEngine::new(MatchConfig::lenient());
//!
//! // All three engines share the same scoring pipeline; only the
//! // threshold and a couple of weights differ.
//! # let _ = (strict, default, lenient);
//! ```
//!
//! ## Determinism and safety
//!
//! - **Deterministic.** Same inputs => same outputs. No clocks, no RNGs,
//! no environment variables.
//! - **No `unsafe`.** This crate forbids `unsafe` code.
//! - **No IO.** The library does not log, read files, or open sockets.
//! - **No panics** in library code paths; every fallible input returns
//! `None` from a scorer or a [`MatchingError`].
// Always start with high quality coding conventions.
// Every method here is a pure computation or an owned-`self` builder step, so
// annotating each one with `#[must_use]` adds noise without catching real bugs.
// Scores are deterministic sentinel values (exactly `0.0` / `1.0`), so the
// tests compare them with `assert_eq!` on purpose.
// The similarity math casts small, bounded counts (string lengths, set sizes)
// to `f64`; the values never approach the 52-bit mantissa limit.
pub use ;
pub use ;
pub use ;
pub use Normalizer;
pub use ;