Event matcher Rust Crate
A Rust library for deciding whether two records describe the same event, modelled on schema.org/Event.
Documentation index:
index.mdis the top-level map of every doc in this repo (spec, AGENTS guides, CHANGELOG, examples). Start there if you're new.
What it does
event-matcher compares pairs of [Event] records — festivals, conferences, concerts, sports fixtures, screenings, conferences, hackathons, meetups, theatre runs — and tells you whether they refer to the same event. It is built for de-duplication and record linkage across event-data sources that disagree on titles, formatting, schedules, venues, and identifier schemes.
The crate provides two strategies behind one engine:
- Deterministic — a hard
boolfrom a shared external event ID (Eventbrite, Meetup, Ticketmaster, Songkick, Wikidata, …) or from an identical normalised name plus an identicalstart_dateinstant. - Probabilistic — a weight-renormalised score in
[0.0, 1.0]over name, start/end date (Gaussian decay over absolute seconds difference), location (venue, address, coordinates), category, country code, external IDs, organiser, performers, and URL, with a per-field [MatchBreakdown] so every decision is auditable.
The library is pure: no IO, no clocks, no RNGs, #![forbid(unsafe_code)], Send + Sync. It is suitable as a leaf dependency under web servers, batch jobs, or notebooks.
Installation
[]
= "0.4"
Quick start — probabilistic match
use ;
let a = builder
.name
.start_date
.end_date
.build;
let b = builder
.name
.add_alternate_name
.start_date
.end_date
.build;
let engine = default_config;
let result = engine.match_events;
assert!;
println!;
println!;
println!;
The MatchBreakdown carries one Option<f64> per scored field; a None means the field was absent on at least one side and so did not contribute. Missing fields neither raise nor lower the score.
Quick start — deterministic match
A shared external event ID (any (scheme, value) pair across the two records) is enough on its own:
use ;
let id = new.unwrap;
let a = builder.name.add_event_id.build;
let b = builder.name.add_event_id.build;
let engine = default_config;
assert!;
Identical normalised name plus a start_date that parses to the same instant is also accepted as a deterministic match (useful when no shared external ID is available, and naturally tolerant of equivalent ISO 8601 offsets such as 2024-09-10T09:00:00Z and 2024-09-10T11:00:00+02:00).
The Event model
Field names use Rust conventions but the semantics match the schema.org/Event properties one for one.
| Field | Type | Schema.org property |
|---|---|---|
name |
Option<String> |
schema:name |
alternate_names |
Vec<String> |
schema:alternateName |
description |
Option<String> |
schema:description |
url |
Option<String> |
schema:url |
event_ids |
Vec<EventId> |
schema:identifier |
category |
Option<EventCategory> |
direct subtype of schema:Event |
keywords |
Vec<String> |
schema:keywords |
in_language |
Option<String> |
schema:inLanguage (BCP 47) |
typical_age_range |
Option<String> |
schema:typicalAgeRange |
start_date |
Option<String> |
schema:startDate (ISO 8601) |
end_date |
Option<String> |
schema:endDate (ISO 8601) |
door_time |
Option<String> |
schema:doorTime |
previous_start_date |
Option<String> |
schema:previousStartDate |
event_status |
Option<EventStatus> |
schema:eventStatus |
event_attendance_mode |
Option<EventAttendanceMode> |
schema:eventAttendanceMode |
location |
Option<Location> |
schema:location (venue + address + geo + virtual URL) |
country_code_as_iso_3166_1_alpha_2 |
Option<String> |
(convenience; pairs with Location.address.country) |
organizer |
Option<String> |
schema:organizer |
performers |
Vec<String> |
schema:performer |
maximum_attendee_capacity |
Option<u32> |
schema:maximumAttendeeCapacity |
maximum_physical_attendee_capacity |
Option<u32> |
schema:maximumPhysicalAttendeeCapacity |
maximum_virtual_attendee_capacity |
Option<u32> |
schema:maximumVirtualAttendeeCapacity |
is_accessible_for_free |
Option<bool> |
schema:isAccessibleForFree |
super_event_id |
Option<String> |
schema:superEvent (id only) |
local_id |
Option<String> |
(originating-system id; not scored) |
EventCategory enumerates the direct schema.org/Event subtypes — BusinessEvent, ChildrensEvent, ComedyEvent, ConferenceEvent, CourseInstance, DanceEvent, DeliveryEvent, EducationEvent, EventSeries, ExhibitionEvent, Festival, FoodEvent, Hackathon, LiteraryEvent, MusicEvent, PerformingArtsEvent, PublicationEvent, SaleEvent, ScreeningEvent, SocialEvent, SportsEvent, TheaterEvent, VisualArtsEvent, and a catch-all Other(String).
EventStatus mirrors schema:EventStatusType (EventScheduled, EventCancelled, EventPostponed, EventRescheduled, EventMovedOnline); EventAttendanceMode mirrors schema:EventAttendanceModeEnumeration (OfflineEventAttendanceMode, OnlineEventAttendanceMode, MixedEventAttendanceMode).
Build records via the fluent [Event::builder]; all setters accept impl Into<String>.
The match pipeline
- Each scoring component yields
Some(score)in[0.0, 1.0]orNone(missing on at least one side). - The weighted sum runs over components that scored.
- The sum of participating weights divides through — renormalisation ensures missing fields do not penalise.
- A phonetic-name match (Soundex on normalised names) adds a
0.05-weighted bonus when the gating phonetic score exceeds0.9, only whenuse_phonetic_matchingis on. The bonus never lowers a score. is_match = score >= match_threshold(strict mode additionally requiresdeterministic_match).confidence = Confidence::from_score(score)— bands are fixed (>= 0.90High,>= 0.75Medium, else Low) and independent ofmatch_threshold.
Default weights
| Component | Weight | Notes |
|---|---|---|
| Name | 0.20 |
Best of cartesian product across primary + alternates, via the configured SimilarityAlgorithm (default Combined = 0.7 × Jaro-Winkler + 0.3 × Levenshtein). |
| Start date | 0.25 |
Gaussian decay exp(-(d/s)^2) over the absolute seconds difference. Default scale s = 3600 s (one hour). |
| End date | 0.05 |
Same Gaussian shape as start_date, same scale. |
| Location | 0.15 |
Coordinates 0.5, address 0.3, venue name 0.15, virtual URL 0.05 — weight-renormalised across populated sub-components. Coordinates use a Gaussian decay with default scale 100 m. |
| Category | 0.08 |
1.0 if equal, 0.0 if both set and differ, None if either missing. |
| Country code | 0.04 |
Case-insensitive equality after trim. |
| Event IDs | 0.15 |
1.0 if any (scheme, value) pair is shared, 0.0 if both non-empty but no overlap, None if either empty. |
| Organiser | 0.04 |
Combined string similarity after name normalisation. |
| Performers | 0.02 |
Best-of cartesian product across performer lists. |
| URL | 0.02 |
Exact equality after trim. |
| Phonetic bonus | +0.05 when gated |
Bonus only — never lowers a score. |
Configuration presets
use ;
let strict = new; // threshold 0.95, requires deterministic
let default = default_config; // threshold 0.80
let lenient = new; // threshold 0.65, phonetic on
- Default (0.80) — balanced for everyday de-duplication.
- Strict (0.95) — for downstream systems that must rely on the answer;
is_matchadditionally requiresdeterministic_match.scoreandconfidenceare unaffected. - Lenient (0.65) — for triaging large candidate sets where false negatives are costlier than false positives.
Every field of MatchConfig is overridable; the config is Serialize + Deserialize (with #[serde(default)]) so tunings can live in a file:
use MatchConfig;
let cfg: MatchConfig = from_str.unwrap;
Batch scoring
use ;
let engine = default_config;
let query = builder.name.build;
let candidates = vec!;
// Parallel to the input slice:
let results = engine.match_one_to_many;
// Sorted by descending score, deterministic tiebreak on original index:
let ranked = engine.rank_one_to_many;
let = &ranked;
println!;
The engine is Send + Sync. Wrap calls in rayon::par_iter (or any parallelism primitive) without changes to this crate. Candidate pre-filtering — Soundex prefix blocking, country-code blocking, year-bucket blocking — is intentionally a consumer concern.
Temporal and geographic primitives
Scorer exposes the helpers the engine uses internally:
use Scorer;
// Temporal proximity.
let secs = seconds_between.unwrap;
let t = start_date_score;
println!;
// Geographic proximity (used inside `location` matching).
let d = haversine_metres;
let s = coordinates_score;
println!;
Scorer::seconds_between parses both inputs via Normalizer::parse_iso8601_unix_seconds (supporting YYYY-MM-DD, …T…Z, …±HH:MM, fractional seconds) and returns None on any unparseable input. The Gaussian scorers return exp(-(d/s)^2) clamped to [0.0, 1.0]; pathological inputs (negative distance, non-positive scale, non-finite) return 0.0.
Determinism and safety
#![forbid(unsafe_code)]at the crate root.- No IO. The library does not read files, open sockets, or log.
- No clocks, no RNGs, no environment variables. Same inputs always produce the same outputs.
- No panics in library code; every fallible parser returns
Noneand every fallible operation returnsResult. Send + Sync. Engines are immutable after construction and cheap to clone.- Serde-clean. Every public data type round-trips through
serde_json(and any otherserdeformat).
Limitations / out of scope
- Not a calendar engine. This crate does not produce recurrence expansions for
schema:eventSchedule/schema:Schedule. Feed concrete instances in if you need to compare them. - Not a geocoder or router. Distances are great-circle (Haversine) only; addresses are not resolved.
- No machine learning. Scoring is rule-based and transparent; weights are tuneable but the algorithm is fixed.
- No persistence layer. The crate scores pairs in memory; storage and indexing belong upstream.
- English-only abbreviation table for street types (
St,Rd,Ave, …). Locale-aware vocabularies are an Open Question.
License
MIT OR Apache-2.0 OR GPL-2.0 OR GPL-3.0 OR BSD-3-Clause — see LICENSE.md.
Contributing
Contributions welcome. Before opening a PR:
cargo fmtcargo clippy --all-targets -- -D warningscargo test
See AGENTS.md for the working guide and spec.md for the authoritative behaviour spec.
Contact
Joel Parker Henderson — joel@joelparkerhenderson.com.