event-matcher 0.6.0

Deterministic and probabilistic matching of schema.org/Event records (ISO 8601 date-times, external event IDs, locations, categories) with explainable per-field score breakdowns.
Documentation

Event matcher Rust Crate

A Rust library for deciding whether two records describe the same event, modelled on schema.org/Event.

Documentation index: index.md is the top-level map of every doc in this repo (spec, AGENTS guides, CHANGELOG, examples). Start there if you're new.

What it does

event-matcher compares pairs of [Event] records — festivals, conferences, concerts, sports fixtures, screenings, conferences, hackathons, meetups, theatre runs — and tells you whether they refer to the same event. It is built for de-duplication and record linkage across event-data sources that disagree on titles, formatting, schedules, venues, and identifier schemes.

The crate provides two strategies behind one engine:

  • Deterministic — a hard bool from a shared external event ID (Eventbrite, Meetup, Ticketmaster, Songkick, Wikidata, …) or from an identical normalised name plus an identical start_date instant.
  • Probabilistic — a weight-renormalised score in [0.0, 1.0] over name, start/end date (Gaussian decay over absolute seconds difference), location (venue, address, coordinates), category, country code, external IDs, organiser, performers, and URL, with a per-field [MatchBreakdown] so every decision is auditable.

The library is pure: no IO, no clocks, no RNGs, #![forbid(unsafe_code)], Send + Sync. It is suitable as a leaf dependency under web servers, batch jobs, or notebooks.

Installation

[dependencies]
event-matcher = "0.4"

Quick start — probabilistic match

use event_matcher::{MatchingEngine, Event};

let a = Event::builder()
    .name("Glastonbury Festival 2024")
    .start_date("2024-06-26T09:00:00Z")
    .end_date("2024-06-30T23:59:00Z")
    .build();

let b = Event::builder()
    .name("Glasto 2024")
    .add_alternate_name("Glastonbury Festival 2024")
    .start_date("2024-06-26T09:15:00Z")
    .end_date("2024-06-30T23:59:00Z")
    .build();

let engine = MatchingEngine::default_config();
let result = engine.match_events(&a, &b);

assert!(result.is_match);
println!("score      = {:.2}", result.score);
println!("name       = {:?}", result.breakdown.name_score);
println!("start_date = {:?}", result.breakdown.start_date_score);

The MatchBreakdown carries one Option<f64> per scored field; a None means the field was absent on at least one side and so did not contribute. Missing fields neither raise nor lower the score.

Quick start — deterministic match

A shared external event ID (any (scheme, value) pair across the two records) is enough on its own:

use event_matcher::{MatchingEngine, Event, EventId, EventIdScheme};

let id = EventId::new(EventIdScheme::Eventbrite, "123456789").unwrap();

let a = Event::builder().name("RustConf 2024").add_event_id(id.clone()).build();
let b = Event::builder().name("RC 2024").add_event_id(id).build();

let engine = MatchingEngine::default_config();
assert!(engine.deterministic_match(&a, &b));

Identical normalised name plus a start_date that parses to the same instant is also accepted as a deterministic match (useful when no shared external ID is available, and naturally tolerant of equivalent ISO 8601 offsets such as 2024-09-10T09:00:00Z and 2024-09-10T11:00:00+02:00).

The Event model

Field names use Rust conventions but the semantics match the schema.org/Event properties one for one.

Field Type Schema.org property
name Option<String> schema:name
alternate_names Vec<String> schema:alternateName
description Option<String> schema:description
url Option<String> schema:url
event_ids Vec<EventId> schema:identifier
category Option<EventCategory> direct subtype of schema:Event
keywords Vec<String> schema:keywords
in_language Option<String> schema:inLanguage (BCP 47)
typical_age_range Option<String> schema:typicalAgeRange
start_date Option<String> schema:startDate (ISO 8601)
end_date Option<String> schema:endDate (ISO 8601)
door_time Option<String> schema:doorTime
previous_start_date Option<String> schema:previousStartDate
event_status Option<EventStatus> schema:eventStatus
event_attendance_mode Option<EventAttendanceMode> schema:eventAttendanceMode
location Option<Location> schema:location (venue + address + geo + virtual URL)
country_code_as_iso_3166_1_alpha_2 Option<String> (convenience; pairs with Location.address.country)
organizer Option<String> schema:organizer
performers Vec<String> schema:performer
maximum_attendee_capacity Option<u32> schema:maximumAttendeeCapacity
maximum_physical_attendee_capacity Option<u32> schema:maximumPhysicalAttendeeCapacity
maximum_virtual_attendee_capacity Option<u32> schema:maximumVirtualAttendeeCapacity
is_accessible_for_free Option<bool> schema:isAccessibleForFree
super_event_id Option<String> schema:superEvent (id only)
local_id Option<String> (originating-system id; not scored)

EventCategory enumerates the direct schema.org/Event subtypes — BusinessEvent, ChildrensEvent, ComedyEvent, ConferenceEvent, CourseInstance, DanceEvent, DeliveryEvent, EducationEvent, EventSeries, ExhibitionEvent, Festival, FoodEvent, Hackathon, LiteraryEvent, MusicEvent, PerformingArtsEvent, PublicationEvent, SaleEvent, ScreeningEvent, SocialEvent, SportsEvent, TheaterEvent, VisualArtsEvent, and a catch-all Other(String).

EventStatus mirrors schema:EventStatusType (EventScheduled, EventCancelled, EventPostponed, EventRescheduled, EventMovedOnline); EventAttendanceMode mirrors schema:EventAttendanceModeEnumeration (OfflineEventAttendanceMode, OnlineEventAttendanceMode, MixedEventAttendanceMode).

Build records via the fluent [Event::builder]; all setters accept impl Into<String>.

The match pipeline

  1. Each scoring component yields Some(score) in [0.0, 1.0] or None (missing on at least one side).
  2. The weighted sum runs over components that scored.
  3. The sum of participating weights divides through — renormalisation ensures missing fields do not penalise.
  4. A phonetic-name match (Soundex on normalised names) adds a 0.05-weighted bonus when the gating phonetic score exceeds 0.9, only when use_phonetic_matching is on. The bonus never lowers a score.
  5. is_match = score >= match_threshold (strict mode additionally requires deterministic_match).
  6. confidence = Confidence::from_score(score) — bands are fixed (>= 0.90 High, >= 0.75 Medium, else Low) and independent of match_threshold.

Default weights

Component Weight Notes
Name 0.20 Best of cartesian product across primary + alternates, via the configured SimilarityAlgorithm (default Combined = 0.7 × Jaro-Winkler + 0.3 × Levenshtein).
Start date 0.25 Gaussian decay exp(-(d/s)^2) over the absolute seconds difference. Default scale s = 3600 s (one hour).
End date 0.05 Same Gaussian shape as start_date, same scale.
Location 0.15 Coordinates 0.5, address 0.3, venue name 0.15, virtual URL 0.05 — weight-renormalised across populated sub-components. Coordinates use a Gaussian decay with default scale 100 m.
Category 0.08 1.0 if equal, 0.0 if both set and differ, None if either missing.
Country code 0.04 Case-insensitive equality after trim.
Event IDs 0.15 1.0 if any (scheme, value) pair is shared, 0.0 if both non-empty but no overlap, None if either empty.
Organiser 0.04 Combined string similarity after name normalisation.
Performers 0.02 Best-of cartesian product across performer lists.
URL 0.02 Exact equality after trim.
Phonetic bonus +0.05 when gated Bonus only — never lowers a score.

Configuration presets

use event_matcher::{MatchConfig, MatchingEngine};

let strict  = MatchingEngine::new(MatchConfig::strict());  // threshold 0.95, requires deterministic
let default = MatchingEngine::default_config();            // threshold 0.80
let lenient = MatchingEngine::new(MatchConfig::lenient()); // threshold 0.65, phonetic on
  • Default (0.80) — balanced for everyday de-duplication.
  • Strict (0.95) — for downstream systems that must rely on the answer; is_match additionally requires deterministic_match. score and confidence are unaffected.
  • Lenient (0.65) — for triaging large candidate sets where false negatives are costlier than false positives.

Every field of MatchConfig is overridable; the config is Serialize + Deserialize (with #[serde(default)]) so tunings can live in a file:

use event_matcher::MatchConfig;

let cfg: MatchConfig = serde_json::from_str(r#"{
    "match_threshold": 0.85,
    "start_date_scale_seconds": 600.0
}"#).unwrap();

Batch scoring

use event_matcher::{MatchingEngine, Event};

let engine = MatchingEngine::default_config();
let query = Event::builder().name("RustConf 2024").build();
let candidates = vec![
    Event::builder().name("PyConf 2024").build(),
    Event::builder().name("RustConf 2024").build(),
    Event::builder().name("GoConf 2024").build(),
];

// Parallel to the input slice:
let results = engine.match_one_to_many(&query, &candidates);

// Sorted by descending score, deterministic tiebreak on original index:
let ranked = engine.rank_one_to_many(&query, &candidates);
let (best_idx, best) = &ranked[0];
println!("best is candidate[{best_idx}] with score {:.2}", best.score);

The engine is Send + Sync. Wrap calls in rayon::par_iter (or any parallelism primitive) without changes to this crate. Candidate pre-filtering — Soundex prefix blocking, country-code blocking, year-bucket blocking — is intentionally a consumer concern.

Temporal and geographic primitives

Scorer exposes the helpers the engine uses internally:

use event_matcher::Scorer;

// Temporal proximity.
let secs = Scorer::seconds_between("2024-06-26T09:00:00Z", "2024-06-26T10:30:00Z").unwrap();
let t    = Scorer::start_date_score(secs as f64, 3600.0);
println!("90-minute gap @ scale=1 h: {t:.6}");

// Geographic proximity (used inside `location` matching).
let d = Scorer::haversine_metres(51.507_22, -0.127_5, 48.853_0, 2.349_2);
let s = Scorer::coordinates_score(d, 100.0);
println!("London-Paris: {:.1} km, score @ scale=100 m: {s:.6}", d / 1000.0);

Scorer::seconds_between parses both inputs via Normalizer::parse_iso8601_unix_seconds (supporting YYYY-MM-DD, …T…Z, …±HH:MM, fractional seconds) and returns None on any unparseable input. The Gaussian scorers return exp(-(d/s)^2) clamped to [0.0, 1.0]; pathological inputs (negative distance, non-positive scale, non-finite) return 0.0.

Determinism and safety

  • #![forbid(unsafe_code)] at the crate root.
  • No IO. The library does not read files, open sockets, or log.
  • No clocks, no RNGs, no environment variables. Same inputs always produce the same outputs.
  • No panics in library code; every fallible parser returns None and every fallible operation returns Result.
  • Send + Sync. Engines are immutable after construction and cheap to clone.
  • Serde-clean. Every public data type round-trips through serde_json (and any other serde format).

Limitations / out of scope

  • Not a calendar engine. This crate does not produce recurrence expansions for schema:eventSchedule / schema:Schedule. Feed concrete instances in if you need to compare them.
  • Not a geocoder or router. Distances are great-circle (Haversine) only; addresses are not resolved.
  • No machine learning. Scoring is rule-based and transparent; weights are tuneable but the algorithm is fixed.
  • No persistence layer. The crate scores pairs in memory; storage and indexing belong upstream.
  • English-only abbreviation table for street types (St, Rd, Ave, …). Locale-aware vocabularies are an Open Question.

License

MIT OR Apache-2.0 OR GPL-2.0 OR GPL-3.0 OR BSD-3-Clause — see LICENSE.md.

Contributing

Contributions welcome. Before opening a PR:

  • cargo fmt
  • cargo clippy --all-targets -- -D warnings
  • cargo test

See AGENTS.md for the working guide and spec.md for the authoritative behaviour spec.

Contact

Joel Parker Henderson — joel@joelparkerhenderson.com.