Crate swh_digestmap

Crate swh_digestmap 

Source
Expand description

§swh-digestmap

A tool to create a map of Software Heritage content hashes, from SWHIDs to SHA1, and a Python binding to access this map.

Designed after a hash conversion service idea. Current implementation is tailored for swh-fuse’s “HPC” variant and relies on VFunc.

Run tests with cargo test --all-features.

A Digestmap is stored as a folder containing 3 files:

  • sha1_git.bin, the table of hashes known by the digestmap,
  • sha1.bin, the table of corresponding sha1 hashes,
  • sha1_git.vfunc, a serialized static function that maps a sha1_git to its index in both tables.

Note: before being able to read the digestmap, the library will need to load the vfunc file in memory. The two other files will be memory-mapped. This sets the requirements to read the complete archive’s map at a minimum of 128GB of RAM, and 1TB to work fully in-memory.

§Installation

Default installation with cargo install swh-digestmap will build and install the swh-digestmap-map binary, which is capable of looking up mapping from an already built map. To be able to build maps yourself, install with cargo install swh-digestmap --features=build, which will also build and install the swh-digestmap-build binary.

§Build a digestmap

The program able to create a map has been isolated in the build feature, because it is mostly intended to Software Heritage’s internal use. Building a digestmap requires to work fully in-memory, please size your machine accordingly.

The program needs an ORC-exported dataset (only the content subfolder).

# Reference to a directory containing a Software Heritage export in ORC format.
# It must contain a subdirectory named `content`.
ORC_EXPORT_DIR=$HOME/swh-environment/swh-graph/swh/graph/example_dataset/orc
swh-digestmap-build --orc $ORC_EXPORT_DIR --dir-out digestmap_dir

§Find a SHA1 from a SWHID

We advise to use the Rust or Python API directly, but for short tests this can also be done one the CLI as follows (where digestmap_dir is the directory generated by the build command above)):

swh-digestmap-map --swhid swh:1:cnt:0000000000000000000000000000000000000004 digestmap_dir

Modules§

build
table

Structs§

DigestMap
Sha1
Sha1Git

Constants§

SHA1_BYTES
SHA1_FILENAME
SHA1_GIT_BYTES
SHA1_GIT_FILENAME
VFUNC_FILENAME