Expand description
A compressed graph representation using the techniques described in “The WebGraph Framework I: Compression Techniques”, by Paolo Boldi and Sebastiano Vigna, in Proc. of the 13th international conference on World Wide Web, WWW 2004, pages 595–602, ACM.
This module provides a flexible way to store and access graphs in compressed
form. A compressed graph with basename BASENAME is described by:
- a graph file (
BASENAME.graph): a bitstream containing the compressed representation of the graph; - a properties file (
BASENAME.properties): metadata about the graph and the compression parameters; - an offsets file (
BASENAME.offsets): a bitstream of γ-coded gaps between the bit offsets of each successor list in the graph file.
Additionally, an Elias–Fano representation of the offsets
(BASENAME.ef), necessary for random access, can be built using the
webgraph build ef command.
The implementation is compatible with the Java implementation, but it provides also a little-endian version.
The main access points to the implementation are BvGraph::with_basename
and BvGraphSeq::with_basename, which provide a LoadConfig that can
be further customized (e.g., selecting endianness, memory mapping, etc.).
§The Graph File
The graph is stored as a bitstream. The format depends on a number of parameters and encodings that can be mixed orthogonally. The parameters are:
- the window size, a nonnegative integer;
- the maximum reference count, a positive integer (it is meaningful only when the window is nonzero);
- the minimum interval length, an integer ≥ 2, or 0, which is interpreted as infinity.
§Successor lists
The graph file is a sequence of successor lists, one for each node. The list of node x can be thought of as a sequence of natural numbers (even though, as we will explain later, this sequence is further coded suitably as a sequence of bits):
-
The outdegree of the node; if it is zero, the list ends here.
-
If the window size is not zero, the reference part, that is:
- a nonnegative integer, the reference, which never exceeds the window size; if the reference is r, the list of successors will be specified as a modified version of the list of successors of x − r; if r is 0, then the list of successors will be specified explicitly;
- if r is nonzero:
- a natural number β, the block count;
- a sequence of β natural numbers B₁, …, Bᵦ, called the copy-block list; only the first number can be zero.
-
Then comes the extra part, specifying additional entries that the list of successors contains (or all of them, if r is zero), that is:
- If the minimum interval length is finite:
- an integer i, the interval count;
- a sequence of i pairs, whose first component is the left extreme of an interval, and whose second component is the length of the interval (the number of integers contained in it).
- Finally, the list of residuals, which contain all successors not specified by previous methods.
- If the minimum interval length is finite:
The above data should be interpreted as follows:
-
The reference part, if present (i.e., if both the window size and the reference are positive), specifies that part of the list of successors of node x − r should be copied; the successors of node x − r that should be copied are described in the copy-block list; more precisely, one should copy the first B₁ entries of this list, discard the next B₂, copy the next B₃, etc. (the last remaining elements of the list of successors will be copied if β is even, and discarded if β is odd).
-
The extra part specifies additional successors (or all of them, if the reference part is absent); the extra part is not present if the number of successors that are to be copied according to the reference part already coincides with the outdegree of x; the successors listed in the extra part are given in two forms:
- some of them are specified as belonging to (integer) intervals, if the minimum interval length is finite; the interval count indicates how many intervals, and the intervals themselves are listed as pairs (left extreme, length);
- the residuals are the remaining “scattered” successors.
§How Successor Lists Are Coded
The list of integers corresponding to each successor list is coded into a sequence of bits. This is done in two phases: we first modify the sequence so to obtain another sequence of integers (some of them might be negative). Then each integer is coded, using a coding that can be specified as an option; the integers that may be negative are first turned into natural numbers using the standard bijection.
- The outdegree of the node is left unchanged, as well as the reference and the block count.
- All blocks are decremented by 1, except for the first one.
- The interval count is left unchanged.
- All interval lengths are decremented by the minimum interval length.
- The first left extreme is expressed as its difference from x (it will be negative if the first extreme is less than x); the remaining left extremes are expressed as their distance from the previous right extreme plus 2 (e.g., if the interval is [5..11] and the previous one was [1..3], then the left extreme 5 is expressed as 5 − (3 + 2) = 0).
- The first residual is expressed as its difference from x (it will be negative if the first residual is less than x); the remaining residuals are expressed as decremented differences from the previous residual.
§The Offsets File
Since the graph is stored as a bitstream, we must have some way to know where each successor list starts. This information is stored in the offset file, which contains the bit offset of each successor list as a γ-coded gap from the previous offset (in particular, the offset of the first successor list will be zero). As a convenience, the offset file contains an additional offset pointing just after the last successor list (providing, as a side-effect, the actual bit length of the graph file).
For random access, the list of offsets is stored as an Elias–Fano
representation using ε-serde. Building such a representation is a
prerequisite for random access and can be done using the webgraph build ef
command.
Re-exports§
pub use sequential::BvGraphSeq;pub use random_access::BvGraph;
Modules§
- factories
- Factories for bit readers.
- random_
access - sequential
Structs§
- BvComp
- Compresses a graph into the BV graph format.
- BvComp
Config - Configures and runs BvGraph compression.
- BvCompZ
- Compresses a graph into the BV graph format using the reference-selection algorithm inspired by “Zuckerli: A New Compressed Representation for Graphs”, by Daniel Marzocchi, Luca Versari, Robert Obryk, and Jyrki Alakuijala, Proc. 2020 Data Compression Conference (DCC), IEEE, 2020.
- Comp
Flags - The compression flags for reading or compressing a graph.
- Comp
Stats - Compression statistics for a BvGraph compression.
- Const
Codes Decoder - An implementation of
Decodewith compile-time defined codes. - Const
Codes Decoder Factory - Const
Codes Encoder - An implementation of
EncodeAndEstimatewith compile time defined codes - Const
Codes Estimator - Debug
Decoder - A debug wrapper on a code read that prints the codes it reads to stderr
- Decoder
Stats - A struct that keeps track of how much bits each piece would take using different codes for compression.
- DynCodes
Decoder - DynCodes
Decoder Factory - DynCodes
Encoder - DynCodes
Estimator - Dynamic
- Dynamic dispatch.
- Empty
Dict - File
- The graph is read from a file; offsets are fully deserialized in memory.
- File
Factory - Load
Config - A load configuration for a
BvGraph/BvGraphSeq. - LoadMem
- The graph and offsets are loaded into allocated memory.
- Load
Mmap - The graph and offsets are loaded into memory obtained via
mmap(). - Masked
Iter - An iterator that filters out blocks of values.
- Memory
Factory - Memory
Flags - Flags for
MemoryFactoryandMmapHelper. - Mmap
- The graph and offsets are memory mapped.
- Offset
DegIter - Fast iterator over the offsets and degrees of a
BvGraph. - Offsets
Writer - A writer for offsets.
- Random
- Sequential
- Static
- Static dispatch.
- Stats
Decoder - A wrapper over a generic
Decodethat keeps track of how much bits each piece would take using different codes for compressions
Constants§
- DEG_
CUMUL_ EXTENSION - EF_
EXTENSION - GRAPH_
EXTENSION - LABELOFFSETS_
EXTENSION - LABELS_
EXTENSION - OFFSETS_
EXTENSION - PROPERTIES_
EXTENSION
Traits§
- Decode
- Methods to decode the component of a
super::BvGraphorsuper::BvGraphSeq. - Dispatch
StaticorDynamicdispatch.- Encode
- Methods to encode the components of a
super::BvGraphorsuper::BvGraphSeq. - Encode
AndEstimate - Load
Mode - Load mode.
- Offsets
- Compound trait expressing the trait bounds for offsets.
- Random
Access Decoder Factory - A trait providing decoders with random access.
- Sequential
Decoder Factory - A trait providing decoders on the whole graph.
Functions§
- check_
offsets - Checks that the offsets stored in the offsets file with given
basename are correct for the given
BvGraphSeq. - get_
endianness - Read the .properties file and return the endianness
- parse_
properties - Read the .properties file and return the number of nodes, number of arcs and compression flags for the graph. The endianness is checked against the expected one.
Type Aliases§
- DCF
- The default type we use for the cumulative function of degrees.
- EF
- The default version of EliasFano we use for the CLI.
- File
BufReader - A type alias for a buffered reader that reads from a file buffer a
u32at a time. - Load
Mode Codes Reader - A type alias for the code reader returned by the
CodesReaderFactoryassociated with aLoadMode. - Load
Mode Factory - A type alias for the
CodesReaderFactoryassociated with aLoadMode. - MemBuf
Reader - A type alias for a buffered reader that reads from a memory buffer a
u32at a time.