ontosim

Rust library for computing structural and semantic similarity between ontology trees.

Implements the algorithm from A mapping-based tree similarity algorithm and its application to ontology alignment (Zhu et al., 2013).

Overview

Consider two similar but distinct ontology trees; in this specific instance let's consider two simple ontology trees that both hierarchically model concepts having to do with Travel. The nodes in each tree represent concepts in the Travel domain, and the arcs encode semantic meaning for how each concept in a tree nests within a parent concept. The semantic meaning of any individual node is not solely the label/concept it represents individually. The semantic meaning of all ancestor nodes up to the root as well are logically incorporated too.

Given the following two trees, how might we deterministically find the optimal node-pair similarity mapping between them that also takes into account the hierarchical semantic meaning encoded into each tree?

graph TD
    %% Left Tree: Travel
    subgraph Travel_System [Ontology::Travel]
        direction TB
        travel --- traffic
        travel --- sights
        travel --- visitor
        traffic --- ship
        traffic --- train
        traffic --- land
        land --- bus1[bus]
    end

    %% Right Tree: Tour
    subgraph Tour_System [Ontology::Tour]
        direction TB
        tour --- transport
        tour --- tourist
        transport --- road
        tourist --- business
        road --- bus2[bus]
        road --- lightbus[light bus]
    end

    %% Cross-Tree Connections (Dotted & Different Color)
    travel -.-|similar to| tour
    visitor -.- tourist
    traffic -.- transport
    land -.- road
    bus1 -.- bus2

    %% Styling to make them visually distinct
    linkStyle 13,14,15,16,17 stroke:#ff6600,stroke-width:2px,stroke-dasharray: 5 5;

In the diagram, after executing the algorithm the dashed red arcs represent the optimal set of node pair mappings for these two trees.

An example of a maximally similar node label pair is bus in the left hand tree paired with bus in the right hand tree rather than with light bus (also semantically similar, but less so). Many nodes are unpaired because they lack a meaningfully similar node to pair with in the opposite tree. The label-to-label similarity calculation is best done using text embeddings to measure the semantic similarity of a left hand label with a right hand label. Traditionally, tree comparison algorithms depend on basic string comparisons or edit distance approaches that are not effective here.

Additionally, this algorithm deterministially discovers the optimal set of similar node pairings in such a way that it also yields the maximally aligned overlap of subtrees and subforests across the two input trees.

To illustrate what this means, in the diagram above the subtree rooted at traffic in the left hand tree and the subtree rooted at transport in the right hand tree have root nodes that are similar. Their subtrees also contain two descendent node-pairs that are similar (land-road and bus-bus). The overall similarity score for these subtree root nodes starts as their label-to-label similarity score, and is then boosted by any descendent matches within their subtrees. If there was some other node in the right hand tree that traffic in the left hand tree was similar to, even if their label-to-label similarity score was higher, if there were no matching descendents and this higher score was lower than this boosted match of traffic with transport (due to their subtrees) it would be ignored.

Visually, the existing subtrees generate the following overall simiilarity score of 2.242, assuming a text vectorization Embedder impl that produces the following label-to-label scores:

lhs label	rhs label	score	overall subtree score
bus	bus	1.000
land	road	0.671
traffic	transport	0.571
			2.242

graph TD
    %% Left Tree: Travel
    subgraph Travel_System [Traffic Subtree]
        direction TB
        traffic --- ship
        traffic --- train
        traffic --- land
        land --- bus1[bus]
    end

    %% Right Tree: Tour
    subgraph Tour_System [Transport Subtree]
        direction TB
        transport --- road
        road --- bus2[bus]
        road --- lightbus[light bus]
    end

    %% Cross-Tree Connections (Dotted & Different Color)
    traffic -.-|0.571| transport
    land -.-|0.671| road
    bus1 -.-|1.000| bus2

    %% Styling to make them visually distinct
    linkStyle 7,8,9 stroke:#ff6600,stroke-width:2px,stroke-dasharray: 5 5;

And for the counter example, if traffic had a higher node-to-node match to a different rhs subtree, but no matching descendents:

lhs label	rhs label	score	overall subtree score
abc	xyz	0.000
def	rst	0.000
traffic	traffic jam	0.737
			0.737

graph TD
    %% Left Tree: Travel
    subgraph Travel_System [Traffic Subtree]
        direction TB
        traffic --- ship
        traffic --- train
        traffic --- land
        land --- bus
    end

    %% Right Tree: Tour
    subgraph Tour_System [Hypothetical Subtree]
        direction TB
        trafficjam[traffic jam] --- abc
        trafficjam --- rst
    end

    %% Cross-Tree Connections (Dotted & Different Color)
    traffic -.-|0.737| trafficjam

    %% Styling to make them visually distinct
    linkStyle 6 stroke:#ff6600,stroke-width:2px,stroke-dasharray: 5 5;

Even though this counter-example demonstrates a significantly higher node-to-node similarity score for the root nodes of these two subtrees, it will be ignored in favor of the higher overall subtree match above.

In this way, the algorithm selects the node pair matches that maximize overall semantic ontology tree structure alignment.

Quick start

use ontosim::{Tree, similarity};
use ontosim::matching::ExactMatching;

let t1: Tree = "{travel{traffic{ship}{train}{land{bus}}}{visitor}{sights}}".parse().unwrap();
let t2: Tree = "{tour{transport{road{bus}{light bus}}}{tourist{business}}}".parse().unwrap();

let result = similarity::compute(&t1, &t2, &ExactMatching);
println!("similarity: {}", result.sim);

Getting Started: Reference Implementation

While this crate provides the core algorithm implementation with zero overhead or dependencies, you can find a separate, fully-functional binary application to serve as a starting point for new projects and real-life usage.

ontosim-local

This application demonstrates how to use the Embedder trait to handle text vectorization for node label semantic similarity matching, and serves as a blueprint for high-performance ML workflows by implementing:

Rust-Native ONNX Execution: Leverages ort (ONNX Runtime) to execute the all-MiniLM-L6-v2 transformer model locally on-device with no Python or external API dependencies.
Semantic Label Embedding: Demonstrates how to map raw node labels into high-dimensional vector space.
Embedder Trait Integration: A concrete implementation of this trait that bridges the library's logic with these ONNX-generated embeddings.

To see the library in action without writing any code:

git clone https://github.com/sehod/ontosim-local
cd ontosim-local
./download_model.sh
cargo build
cargo run -- examples/canonical

How the algorithm works

Each tree is decomposed into subtrees indexed in postorder.
Pairwise node similarity is computed via a pluggable Matching trait (label equality, cosine similarity of embeddings, or custom).
Optimal child-pairing at each level is solved with the Hungarian method (Kuhn-Munkres).
A bottom-up dynamic-programming pass produces the overall similarity score and node-level mappings.

Matching strategies

Strategy	Description
`ExactMatching` (Default)	For illustration only; 1.0 for identical labels, 0.0 otherwise
`EmbeddingMatching`	Cosine similarity of pre-populated embedding vectors
Custom `impl Matching`	Any user-defined pairwise node similarity function

Building & testing

cargo build
cargo test

License

MIT

ontosim 0.2.2