Module _tutorial

Expand description

§Documentation: Tutorial

If you are already familiar with the Java API, you may skip this guide and read the (shorter) crash course instead.

§Generic graph algorithms

The central classes of this crate are SwhUnidirectionalGraph and SwhBidirectionalGraph. They both allow getting successors of a node (SwhForwardGraph trait); while only the latter allows getting its predecessors (SwhBackwardGraph trait).

You instantiate a graph from the path to the prefix used by a graph on your filesystem. This graph may be either swh/graph/example_dataset/compressed/graph from swh-graph’s source code repository, or a compressed graph downloaded from Amazon S3.

use std::path::PathBuf;

let graph = swh_graph::graph::load_full::<swh_graph::mph::DynMphf>(PathBuf::from("./graph"))
    .expect("Could not load graph");

§Example: Average outdegree

The following code can be used to compute the average outdegree of a graph, which is a useful measure of its density:

 use std::path::PathBuf;
 use swh_graph::graph::*;

let basename = PathBuf::from("../swh/graph/example_dataset/compressed/example");

let graph = swh_graph::graph::load_full::<swh_graph::mph::DynMphf>(basename)
    .expect("Could not load graph");

let average_outdegree = graph.num_arcs() as f64 / graph.num_nodes() as f64;
assert_eq!(average_outdegree, 28. / 24.);

§Example: Degree distributions

Using the outdegree() primitive, we can compute the outdegree distribution of the graph by iterating on all its nodes. The distribution will be returned as a map that associates to each degree d the number of nodes with outdegree d.

use std::collections::HashMap;
use std::path::PathBuf;

use swh_graph::graph::*;


let graph = swh_graph::graph::load_full::<swh_graph::mph::DynMphf>(basename)
    .expect("Could not load graph");

let mut distribution = HashMap::<usize, u64>::new();
for node in 0..graph.num_nodes() {
    *distribution.entry(graph.outdegree(node)).or_insert(0) += 1;
}

assert_eq!(distribution, HashMap::from([(0, 7), (1, 8), (2, 7), (3, 2)]));

§Example: Degree distributions (in parallel)

Or, in parallel using the rayon library:

use std::collections::HashMap;
use std::sync::atomic::{AtomicU64, Ordering};

use rayon::prelude::*;

use std::path::PathBuf;
use swh_graph::graph::*;


let graph = swh_graph::graph::load_full::<swh_graph::mph::DynMphf>(basename)
    .expect("Could not load graph");

let distribution: HashMap<usize, u64> = (0..graph.num_nodes())
    .into_par_iter()
    .fold(
        // In each thread, start with an empty distribution
        || HashMap::new(),
        // In each thread, compute the distribution for the thread's subset of nodes
        |mut thread_distribution, node: usize| {
            *thread_distribution
                .entry(graph.outdegree(node))
                .or_insert(0) += 1;
            thread_distribution
        },
    )
    // Merge distributions computed by each thread
    .reduce(
        // identity
        || HashMap::new(),
        |mut distribution1, distribution2| {
            for (outdegree, count) in distribution2 {
                *distribution1.entry(outdegree).or_insert(0) += count
            }
            distribution1
        }
    );

assert_eq!(distribution, HashMap::from([(0, 7), (1, 8), (2, 7), (3, 2)]));

§Example: Depth-First Traversal

The successors primitive can be used to write a simple stack-based DFS traversal on the graph which starts from a given node and prints all the descendant nodes in its transitive closure:

use std::collections::HashSet;
use std::path::PathBuf;

use swh_graph::graph::*;

fn visit_nodes_dfs<G: SwhForwardGraph>(graph: G, root: NodeId) {
    let mut stack = Vec::new();
    let mut visited = HashSet::new();

    stack.push(root);
    visited.insert(root);

    while let Some(node) = stack.pop() {
        println!("{}", node);

        for succ in graph.successors(node) {
            if !visited.contains(&succ) {
                stack.push(succ);
                visited.insert(succ);
            }
        }
    }
}

§Example: Breadth-First Traversal

Swapping the stack for a queue changes the traversal order from depth-first to breadth-first:

use std::collections::{HashSet, VecDeque};
use std::path::PathBuf;

use swh_graph::graph::*;

fn visit_nodes_bfs<G: SwhForwardGraph>(graph: G, root: NodeId) {
    let mut queue = VecDeque::new();
    let mut visited = HashSet::new();

    queue.push_back(root);
    visited.insert(root);

    while let Some(node) = queue.pop_front() {
        println!("{}", node);

        for succ in graph.successors(node) {
            if !visited.contains(&succ) {
                queue.push_back(succ);
                visited.insert(succ);
            }
        }
    }
}

§Node types and SWHIDs

In the Software Heritage archive, nodes are not represented by a simple integer, but by a SWHID, which contain both the type of the node (revision, directory, blob…) and its unique identifier. We use node mappings which allow us to translate between SWHIDs and the compact node IDs used in the compressed graph.

Most notably, we use a MPH (Minimal Perfect Hash) function, a Rust rewrite of the one implemented in the GOVMinimalPerfectHashFunction class of the Sux4J library, which maps N keys to N consecutive integers with no collisions.

The following files are used to store the mappings between the nodes and their types:

graph.cmph: contains a serialized minimal perfect hash function computed on the list of all the SWHIDs in the graph.
graph.order: contains the permutation that associates with each output of the MPH the node ID to which it corresponds
graph.node2swhid.bin: contains a compact binary representation of all the SWHIDs in the graph, ordered by their rank in the graph file.
graph.node2type.bin: contains an array of three-bits integers, which each represents the type of a node.

This struct exposes (through .properties()) the following methods:

swhid(NodeId) -> SWHID: returns the SWHID associated with a given node ID. This function does a lookup of the SWHID at offset i in the file graph.node2swhid.bin.
node_id(SWHID) -> Result<NodeId, _>: returns the node ID associated with a given SWHID. It works by hashing the SWHID with the function stored in graph.mph, then permuting it using the permutation stored in graph.order. It does additional domain-checking by calling getSWHID() on its own result to check that the input SWHID was valid.
swhid(NodeId) -> NodeType: returns the type of a given node, as an enum of all the different object types in the Software Heritage data model. It does so by looking up the value at offset i in the bit vector stored in graph.node2type.bin.

§Example: Find the target directory of a revision

As an example, we use the methods mentioned above to perform the following task: “given a revision, return its target directory”. To do so, we first look up the node ID of the given revision in the compressed graph. We iterate on the successors of that node, and return the SWHID of the first destination node that has the “directory” type.

use anyhow::{anyhow, ensure, Result};
use swh_graph::graph::{SwhForwardGraph, SwhGraphWithProperties};
use swh_graph::properties;
use swh_graph::{NodeType, SWHID};

fn find_directory_of_revision<G>(graph: &G, rev_swhid: SWHID) -> Result<SWHID>
where
    G: SwhForwardGraph + SwhGraphWithProperties<Maps: properties::Maps>,
{
    let rev = graph.properties().node_id(rev_swhid)?;
    ensure!(graph.properties().node_type(rev) == NodeType::Revision);
    for succ in graph.successors(rev) {
        if graph.properties().node_type(succ) == NodeType::Directory {
            return Ok(graph.properties().swhid(succ));
        }
    }

    Err(anyhow!("Revision has no target directory"))
}

§Node properties

The Software Heritage Graph is a property graph, which means it has various properties associated with its nodes and edges (e.g., commit timestamps, authors, messages, …). We compress these properties and store them in files alongside the compressed graph. This allows you to write traversal algorithms that depend on these properties.

By default, properties are not assumed to be present are not loaded when the graph itself is loaded. If you want to use a property, you need to explicitly load it first. As an example, this is how you load the “content length” property to get the length of a given blob:

use std::path::PathBuf;

use anyhow::{Result, ensure};

use swh_graph::graph::SwhGraphWithProperties;
use swh_graph::mph::DynMphf;
use swh_graph::properties;
use swh_graph::{NodeType, SWHID};

fn main() {
    let graph = swh_graph::graph::load_full::<swh_graph::mph::DynMphf>(PathBuf::from("./graph"))
        .unwrap();

    let swhid = SWHID::try_from("swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2").unwrap();
    println!("Content length: {:?}", get_content_length(&graph, swhid).unwrap())
}

fn get_content_length<G>(graph: &G, cnt_swhid: SWHID) -> Result<Option<u64>>
where
    G: SwhGraphWithProperties<Maps: properties::Maps, Contents: properties::Contents>,
{
    let cnt= graph.properties().node_id(cnt_swhid)?;
    ensure!(graph.properties().node_type(cnt) == NodeType::Content);
    Ok(graph.properties().content_length(cnt)) // Note this may be None if unknown size
}

The documentation of the SwhGraphProperties class lists all the different properties, their types, and the methods used to load them and to get their value for a specific node.

A few things of note:

A single loading call can load multiple properties at once; this is because they are stored in the same file to be more space efficient.
Persons (authors, committers etc) are exported as a single pseudonymized integer ID, that uniquely represents a full name + email.
Timestamps are stored as an i64 (for the timestamp itself) and an i16 (for the UTC offset).

§Edge labels

While looking up graph properties on the nodes of the graph is relatively straightforward, doing so for labels on the arcs is comparatively more difficult. These include the names and permissions of directory entries, as well as the branch names of snapshots.

SwhUnidirectionalGraph and SwhBidirectionalGraph both conditionally implement the SwhLabeledForwardGraph trait; and the latter also implements (SwhLabeledBackwardGraph trait), which mirror the successors and predecessors methods we saw before, but augment them to also look up the label of arcs while iterating on them.

To get them to implement these traits, you need to transform them with the load_labels method:

This labelled graph is stored in the following files:

graph-labelled.properties: a property file describing the graph, notably containing the basename of the wrapped graph.
graph-labelled.labels: the compressed labels
graph-labelled.labeloffsets: the offsets used to access the labels in random order.

The SwhUnidirectionalGraph and SwhBidirectionalGraph structs contain a load_labels method. This method consumes the un-labled struct, and returns a new one with labels available. The following methods can then be used:

labeled_successors(NodeId): returns an iterator of (NodeId, iterator of EdgeLabel).
untyped_labeled_successors(NodeId): returns an iterator of (NodeId, iterator of UntypedEdgeLabel), which can be more efficient if you know the type of an edge label because of knowledge you have of the node.

§Label format

The labels of each arc are returned as an iterator of EdgeLabel. They encode either:

for directory entries: both the name of a directory entry and its permissions
for snapshot branches, only the “name” field is useful.
for origin visits: the timestamp of the visit, and a bit indicating if this is a “full” visit (as opposed to a visit that failed halfway)

Arc label names are encoded as an integer ID representing each unique entry/branch name present in the graph. To retrieve the actual name associated with a given label ID, one needs to load the reverse mapping with properties.load_label_names(), similar to how you would do for a normal property.

§Multiedges

The Software Heritage is not a simple graph, where at most one edge can exist between two vertices, but a multigraph, where multiple edges can be incident to the same two vertices. Consider for instance the case of a single directory test/ containing twice the same file blob (e.g., the empty file), under two different names (e.g., ISSUES.txt and TODO.txt, both completely empty). The simple graph view of this directory will represent it as a single edge test → empty file, while the multigraph view will represent it as two edges between the same nodes.

Due to the copy-list model of compression, WebGraph only stores simple graphs, and thus stores multiedges as single edges, to which we cannot associate a single label name (in our example, we need to associate both names ISSUES.txt and TODO.txt). To represent this possibility of having multiple file names for a single arc, in the case of multiple relationships between two identical nodes, each arc label is stored as an array of EdgeLabel, each record representing one relationship between two nodes.

§Example: Printing all the entries of a directory

The following code showcases how one can print all the entries (name, permission and target SWHID) of a given directory, using the labelled methods seen above.

use swh_graph::mph::DynMphf;
use swh_graph::labels::EdgeLabel;

let graph = swh_graph::graph::load_full::<swh_graph::mph::DynMphf>(PathBuf::from("./graph"))
    .expect("Could not load graph");

let node_id: usize = graph
    .properties()
    .node_id("swh:1:dir:5e1c24e586ef92dbef0e9cec6b354c6831454340")
    .expect("Unknown SWHID");

for (succ, labels) in graph.labeled_successors(node_id) {
    for label in labels {
        if let EdgeLabel::DirEntry(label) = label {
            println!(
                "{} -> {} (permission: {:?}; name: {})",
                node_id,
                succ,
                label.permission(),
                String::from_utf8(
                    graph.properties().label_name(label.filename_id())
                ).expect("Could not decode file name as UTF-8")
            );
        }
    }
}

§Transposed graph

Up until now, we have only looked at how to traverse the forward graph, i.e., the directed graph whose edges are in the same direction as the Merkle DAG of the Software Heritage archive. For many purposes, especially that of finding the provenance of software artifacts, it is useful to query the backward (or transposed) graph instead, which is the same as the forward graph except all the edges are reversed.

The transposed graph has its own set of files, counterparts to the files needed for the forward graph:

graph-transposed.graph
graph-transposed.properties
graph-transposed.offsets
graph-transposed.ef
graph-transposed-labelled.labels
graph-transposed-labelled.labeloffsets
graph-transposed-labelled.properties

However, because node IDs are the same in the forward and the backward graph, all the files that pertain to mappings between the node IDs and various properties (SWHIDs, property data, node permutations etc) remain the same.

§Example: Earliest revision containing a given blob

The following code loads all the committer timestamps of the revisions in the graph, then walks the transposed graph to return the earliest revision containing a given object.

use std::collections::HashSet;
use std::path::PathBuf;
use clap::Parser;
use swh_graph::graph::{NodeId, SwhBackwardGraph, SwhLabeledBackwardGraph, SwhGraphWithProperties};
use swh_graph::mph::DynMphf;
use swh_graph::labels::{EdgeLabel};
use swh_graph::properties;
use swh_graph::{NodeType, SWHID};

fn find_earliest_revision_containing<G>(graph: &G, src: NodeId) -> Option<NodeId>
where
    G: SwhBackwardGraph + SwhGraphWithProperties,
    <G as SwhGraphWithProperties>::Maps: properties::Maps,
    <G as SwhGraphWithProperties>::Timestamps: properties::Timestamps,
{

    let mut oldest_rev = None;
    let mut oldest_rev_ts = i64::MAX;

    let mut stack = Vec::new();
    let mut visited = HashSet::new();

    stack.push(src);
    visited.insert(src);

    while let Some(node) = stack.pop() {
        for pred in graph.predecessors(node) {
            if !visited.contains(&pred) {
                stack.push(pred);
                visited.insert(pred);
                if graph.properties().node_type(pred) == NodeType::Revision {
                    if let Some(ts) = graph.properties().committer_timestamp(pred) {
                        if ts < oldest_rev_ts {
                            oldest_rev = Some(pred);
                            oldest_rev_ts = ts;
                        }
                    }
                }
            }
        }
    }

    oldest_rev
}

#[derive(Parser, Debug)]
#[command(version, about, long_about = None)]
struct Args {
    /// Path to the graph to load
    #[arg(short, long)]
    graph: PathBuf,

    /// SWHID of the content to lookup
    #[arg(short)]
    swhid: String,
}


fn main() {
    let args = Args::parse();

    let graph = swh_graph::graph::SwhBidirectionalGraph::new(args.graph)
        .expect("Could not load graph")
        .init_properties()
        .load_properties(|properties| properties.load_maps::<DynMphf>())
        .expect("Could not load maps")
        .load_properties(|properties| properties.load_timestamps())
        .expect("Could not load timestamps")
        .load_labels()
        .expect("Could not load labels");

    let swhid = SWHID::try_from(args.swhid.as_str()).expect("Invalid SWHID");
    let node = graph.properties().node_id(swhid).expect("Unknown SWHID");
    let oldest_rev = find_earliest_revision_containing(&graph, node);
    if let Some(oldest_rev) = oldest_rev {
        let rev_swhid = graph.properties().swhid(oldest_rev);
        println!("The oldest revision containing {} is {}", swhid, rev_swhid);
    } else {
        println!("No revisions contains {}", swhid);
    }
}

Module _tutorialCopy item path