Expand description
§swh-digestmap
A tool to create a map of Software Heritage content hashes, from SWHIDs to SHA1, and a Python binding to access this map.
Designed after a hash conversion service idea. Current implementation is tailored for swh-fuse’s “HPC” variant and relies on VFunc.
Run tests with cargo test --all-features
.
A Digestmap is stored as a folder containing 3 files:
sha1_git.bin
, the table of hashes known by the digestmap,sha1.bin
, the table of correspondingsha1
hashes,sha1_git.vfunc
, a serialized static function that maps asha1_git
to its index in both tables.
Note: before being able to read the digestmap,
the library will need to load the vfunc
file in memory.
The two other files will be memory-mapped.
This sets the requirements to read the complete archive’s map at a minimum of 128GB of RAM,
and 1TB to work fully in-memory.
§Installation
Default installation with cargo install swh-digestmap
will build and install the swh-digestmap-map
binary,
which is capable of looking up mapping from an already built map.
To be able to build maps yourself, install with cargo install swh-digestmap --features=build
,
which will also build and install the swh-digestmap-build
binary.
§Build a digestmap
The program able to create a map has been isolated in the build
feature,
because it is mostly intended to Software Heritage’s internal use.
Building a digestmap requires to work fully in-memory, please size your machine accordingly.
The program needs an ORC-exported dataset
(only the content
subfolder).
# Reference to a directory containing a Software Heritage export in ORC format.
# It must contain a subdirectory named `content`.
ORC_EXPORT_DIR=$HOME/swh-environment/swh-graph/swh/graph/example_dataset/orc
swh-digestmap-build --orc $ORC_EXPORT_DIR --dir-out digestmap_dir
§Find a SHA1 from a SWHID
We advise to use the Rust or Python API directly, but for short tests this can also be done one the CLI as follows
(where digestmap_dir
is the directory generated by the build command above)):
swh-digestmap-map --swhid swh:1:cnt:0000000000000000000000000000000000000004 digestmap_dir