ontosim
Rust library for computing structural and semantic similarity between ontology trees.
Implements the algorithm from A mapping-based tree similarity algorithm and its application to ontology alignment (Zhu et al., 2013).
Overview
Consider two similar but distinct ontology trees; in this specific instance lets consider two simple ontology trees that both hierarchically model concepts having to do with Travel. The nodes in each tree represent concepts in the Travel domain, and the arcs encode semantic meaning for how each concept in a tree nests within a parent concept. The semantic meaning of any individual node is not solely the label/concept it represents individually. The semantic meaning of all ancestor nodes up to the root as well are logically incorporated too.
Given the following two trees, how might we deterministically find the optimal node-pair similarity mapping between them that also takes into account the hierarchical semantic meaning encoded into each tree?
graph TD
%% Left Tree: Travel
subgraph Travel_System [Ontology::Travel]
direction TB
travel1[travel] --- traffic
travel1 --- sights
travel1 --- visitor
traffic --- ship
traffic --- train
traffic --- land
land --- bus1[bus]
end
%% Right Tree: Tour
subgraph Tour_System [Ontology::Tour]
direction TB
tour --- transport
tour --- tourist
transport --- road
tourist --- business
road --- bus2[bus]
road --- lightbus[light bus]
end
%% Cross-Tree Connections (Dotted & Different Color)
travel1 -.-|similar to| tour
visitor -.- tourist
traffic -.- transport
land -.- road
bus1 -.- bus2
%% Styling to make them visually distinct
linkStyle 13,14,15,16,17 stroke:#ff6600,stroke-width:2px,stroke-dasharray: 5 5;
In the diagram, after executing the algorithm the dashed red arcs represent the optimal set of node pair mappings for these two trees.
An example of a maximally similar node label pair is bus in the left hand tree paired with bus in the right hand tree rather than with light bus (also semantically similar, but less so). Many nodes are unpaired because they lack a meaningfully similar node to pair with in the opposite tree. The label-to-label similarity calculation is best done using text embeddings to measure the semantic similarity of a left hand label with a right hand label; this is far more effective than basic string comparisons or edit distance approaches.
Additionally, this algorithm deterministially discovers the optimal set of similar node pairings in such a way that it also yields the maximally aligned overlap of subtrees and subforests across the two input trees.
To illustrate what this means, in the diagram above the subtree rooted at traffic in the left hand tree and the subtree rooted at transport in the right hand tree have root nodes that are similar. Their subtrees also contain two descendent node-pair matches (land-road and bus-bus). The similarity score for these subtree root nodes starts as their label-to-label similarity score, and is then boosted by the descendent matches within their subtrees (recursively). If there was some other node in the right hand tree that traffic in the left hand tree was similar to, even if their label-to-label similarity score was higher, if there were no matching descendents and this higher score was lower that the boosted match of traffic with transport due to their subtrees it would be ignored. In this way, the algorithm selects the node pair matches that maximize overall semantic tree structure alignment.
Quick start
use ;
use ExactMatching;
let t1: Tree = "{travel{traffic{ship}{train}{land{bus}}}{visitor}{sights}}".parse.unwrap;
let t2: Tree = "{tour{transport{road{bus}{light bus}}}{tourist{business}}}".parse.unwrap;
let result = compute;
println!;
How it works
- Each tree is decomposed into subtrees indexed in postorder.
- Pairwise node similarity is computed via a pluggable
Matchingtrait (label equality, cosine similarity of embeddings, or custom). - Optimal child-pairing at each level is solved with the Hungarian method (Kuhn-Munkres).
- A bottom-up dynamic-programming pass produces the overall similarity score and node-level mappings.
Matching strategies
| Strategy | Description |
|---|---|
ExactMatching |
1.0 for identical labels, 0.0 otherwise |
EmbeddingMatching |
Cosine similarity of pre-populated embedding vectors |
Custom impl Matching |
Any user-defined pairwise node similarity function |
Building & testing
License
MIT