Module bvgraph

Expand description

A compressed graph representation using the techniques described in “The WebGraph Framework I: Compression Techniques”, by Paolo Boldi and Sebastiano Vigna, in Proc. of the 13th international conference on World Wide Web, WWW 2004, pages 595–602, ACM.

This module provides a flexible way to store and access graphs in compressed form. A compressed graph with basename BASENAME is described by:

a graph file (BASENAME.graph): a bitstream containing the compressed representation of the graph;
a properties file (BASENAME.properties): metadata about the graph and the compression parameters;
an offsets file (BASENAME.offsets): a bitstream of γ-coded gaps between the bit offsets of each successor list in the graph file.

Additionally, an Elias–Fano representation of the offsets (BASENAME.ef), necessary for random access, can be built using the webgraph build ef command.

The implementation is compatible with the Java implementation, but it provides also a little-endian version.

The main access points to the implementation are BvGraph::with_basename and BvGraphSeq::with_basename, which provide a LoadConfig that can be further customized (e.g., selecting endianness, memory mapping, etc.).

§The Graph File

The graph is stored as a bitstream. The format depends on a number of parameters and encodings that can be mixed orthogonally. The parameters are:

the window size, a nonnegative integer;
the maximum reference count, a positive integer (it is meaningful only when the window is nonzero);
the minimum interval length, an integer ≥ 2, or 0, which is interpreted as infinity.

§Successor lists

The graph file is a sequence of successor lists, one for each node. The list of node x can be thought of as a sequence of natural numbers (even though, as we will explain later, this sequence is further coded suitably as a sequence of bits):

The outdegree of the node; if it is zero, the list ends here.
If the window size is not zero, the reference part, that is:
1. a nonnegative integer, the reference, which never exceeds the window size; if the reference is r, the list of successors will be specified as a modified version of the list of successors of x − r; if r is 0, then the list of successors will be specified explicitly;
2. if r is nonzero:
  - a natural number β, the block count;
  - a sequence of β natural numbers B₁, …, Bᵦ, called the copy-block list; only the first number can be zero.
Then comes the extra part, specifying additional entries that the list of successors contains (or all of them, if r is zero), that is:
1. If the minimum interval length is finite:
  - an integer i, the interval count;
  - a sequence of i pairs, whose first component is the left extreme of an interval, and whose second component is the length of the interval (the number of integers contained in it).
2. Finally, the list of residuals, which contain all successors not specified by previous methods.

The above data should be interpreted as follows:

The reference part, if present (i.e., if both the window size and the reference are positive), specifies that part of the list of successors of node x − r should be copied; the successors of node x − r that should be copied are described in the copy-block list; more precisely, one should copy the first B₁ entries of this list, discard the next B₂, copy the next B₃, etc. (the last remaining elements of the list of successors will be copied if β is even, and discarded if β is odd).
The extra part specifies additional successors (or all of them, if the reference part is absent); the extra part is not present if the number of successors that are to be copied according to the reference part already coincides with the outdegree of x; the successors listed in the extra part are given in two forms:
- some of them are specified as belonging to (integer) intervals, if the minimum interval length is finite; the interval count indicates how many intervals, and the intervals themselves are listed as pairs (left extreme, length);
- the residuals are the remaining “scattered” successors.

§How Successor Lists Are Coded

The list of integers corresponding to each successor list is coded into a sequence of bits. This is done in two phases: we first modify the sequence so to obtain another sequence of integers (some of them might be negative). Then each integer is coded, using a coding that can be specified as an option; the integers that may be negative are first turned into natural numbers using the standard bijection.

The outdegree of the node is left unchanged, as well as the reference and the block count.
All blocks are decremented by 1, except for the first one.
The interval count is left unchanged.
All interval lengths are decremented by the minimum interval length.
The first left extreme is expressed as its difference from x (it will be negative if the first extreme is less than x); the remaining left extremes are expressed as their distance from the previous right extreme plus 2 (e.g., if the interval is [5..11] and the previous one was [1..3], then the left extreme 5 is expressed as 5 − (3 + 2) = 0).
The first residual is expressed as its difference from x (it will be negative if the first residual is less than x); the remaining residuals are expressed as decremented differences from the previous residual.

§The Offsets File

Since the graph is stored as a bitstream, we must have some way to know where each successor list starts. This information is stored in the offset file, which contains the bit offset of each successor list as a γ-coded gap from the previous offset (in particular, the offset of the first successor list will be zero). As a convenience, the offset file contains an additional offset pointing just after the last successor list (providing, as a side-effect, the actual bit length of the graph file).

For random access, the list of offsets is stored as an Elias–Fano representation using ε-serde. Building such a representation is a prerequisite for random access and can be done using the webgraph build ef command.

Re-exports§

pub use sequential::BvGraphSeq;
pub use random_access::BvGraph;

Modules§

factories: Factories for bit readers.
random_access
sequential

Structs§

BvComp: Compresses a graph into the BV graph format.
BvCompConfig: Configures and runs BvGraph compression.
BvCompZ: Compresses a graph into the BV graph format using the reference-selection algorithm inspired by “Zuckerli: A New Compressed Representation for Graphs”, by Daniel Marzocchi, Luca Versari, Robert Obryk, and Jyrki Alakuijala, Proc. 2020 Data Compression Conference (DCC), IEEE, 2020.
CompFlags: The compression flags for reading or compressing a graph.
CompStats: Compression statistics for a BvGraph compression.
ConstCodesDecoder: An implementation of Decode with compile-time defined codes.
ConstCodesDecoderFactory
ConstCodesEncoder: An implementation of EncodeAndEstimate with compile time defined codes
ConstCodesEstimator
DebugDecoder: A debug wrapper on a code read that prints the codes it reads to stderr
DecoderStats: A struct that keeps track of how much bits each piece would take using different codes for compression.
DynCodesDecoder
DynCodesDecoderFactory
DynCodesEncoder
DynCodesEstimator
Dynamic: Dynamic dispatch.
EmptyDict
File: The graph is read from a file; offsets are fully deserialized in memory.
FileFactory
LoadConfig: A load configuration for a BvGraph/BvGraphSeq.
LoadMem: The graph and offsets are loaded into allocated memory.
LoadMmap: The graph and offsets are loaded into memory obtained via mmap().
MaskedIter: An iterator that filters out blocks of values.
MemoryFactory
MemoryFlags: Flags for MemoryFactory and MmapHelper.
Mmap: The graph and offsets are memory mapped.
OffsetDegIter: Fast iterator over the offsets and degrees of a BvGraph.
OffsetsWriter: A writer for offsets.
Random
Sequential
Static: Static dispatch.
StatsDecoder: A wrapper over a generic Decode that keeps track of how much bits each piece would take using different codes for compressions

Constants§

DEG_CUMUL_EXTENSION
EF_EXTENSION
GRAPH_EXTENSION
LABELOFFSETS_EXTENSION
LABELS_EXTENSION
OFFSETS_EXTENSION
PROPERTIES_EXTENSION

Traits§

Decode: Methods to decode the component of a super::BvGraph or super::BvGraphSeq.
Dispatch: Static or Dynamic dispatch.
Encode: Methods to encode the components of a super::BvGraph or super::BvGraphSeq.
EncodeAndEstimate
LoadMode: Load mode.
Offsets: Compound trait expressing the trait bounds for offsets.
RandomAccessDecoderFactory: A trait providing decoders with random access.
SequentialDecoderFactory: A trait providing decoders on the whole graph.

Functions§

check_offsets: Checks that the offsets stored in the offsets file with given basename are correct for the given BvGraphSeq.
get_endianness: Read the .properties file and return the endianness
parse_properties: Read the .properties file and return the number of nodes, number of arcs and compression flags for the graph. The endianness is checked against the expected one.

Type Aliases§

DCF: The default type we use for the cumulative function of degrees.
EF: The default version of EliasFano we use for the CLI.
FileBufReader: A type alias for a buffered reader that reads from a file buffer a u32 at a time.
LoadModeCodesReader: A type alias for the code reader returned by the CodesReaderFactory associated with a LoadMode.
LoadModeFactory: A type alias for the CodesReaderFactory associated with a LoadMode.
MemBufReader: A type alias for a buffered reader that reads from a memory buffer a u32 at a time.

Module bvgraph

Module bvgraph Copy item path

§The Graph File

§Successor lists

§How Successor Lists Are Coded

§The Offsets File

Re-exports§

Modules§

Structs§

Constants§

Traits§

Functions§

Type Aliases§

Module bvgraph