1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
/*
* SPDX-FileCopyrightText: 2023 Inria
* SPDX-FileCopyrightText: 2023 Sebastiano Vigna
*
* SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later
*/
//! A compressed graph representation using the techniques described in "[The
//! WebGraph Framework I: Compression Techniques][BvGraph paper]", by Paolo
//! Boldi and Sebastiano Vigna, in _Proc. of the 13th international conference
//! on World Wide Web_, WWW 2004, pages 595–602, ACM.
//!
//! This module provides a flexible way to store and access graphs in compressed
//! form. A compressed graph with basename `BASENAME` is described by:
//!
//! - a _graph file_ (`BASENAME.graph`): a bitstream containing the compressed
//! representation of the graph;
//! - a _properties file_ (`BASENAME.properties`): metadata about the graph and
//! the compression parameters;
//! - an _offsets file_ (`BASENAME.offsets`): a bitstream of γ-coded gaps between
//! the bit offsets of each successor list in the graph file.
//!
//! Additionally, an [Elias–Fano] representation of the offsets
//! (`BASENAME.ef`), necessary for random access, can be built using the
//! `webgraph build ef` command.
//!
//! The implementation is compatible with the [Java
//! implementation](http://webgraph.di.unimi.it/), but it provides also a
//! little-endian version.
//!
//! The main access points to the implementation are [`BvGraph::with_basename`]
//! and [`BvGraphSeq::with_basename`], which provide a [`LoadConfig`] that can
//! be further customized (e.g., selecting endianness, memory mapping, etc.).
//!
//! # The Graph File
//!
//! The graph is stored as a bitstream. The format depends on a number of
//! parameters and encodings that can be mixed orthogonally. The parameters are:
//!
//! - the _window size_, a nonnegative integer;
//! - the _maximum reference count_, a positive integer (it is meaningful only
//! when the window is nonzero);
//! - the _minimum interval length_, an integer ≥ 2, or 0, which is interpreted
//! as infinity.
//!
//! ## Successor lists
//!
//! The graph file is a sequence of successor lists, one for each node. The list
//! of node _x_ can be thought of as a sequence of natural numbers (even though,
//! as we will explain later, this sequence is further coded suitably as a
//! sequence of bits):
//!
//! 1. The _outdegree_ of the node; if it is zero, the list ends here.
//!
//! 2. If the window size is not zero, the _reference part_, that is:
//! 1. a nonnegative integer, the _reference_, which never exceeds the window
//! size; if the reference is _r_, the list of successors will be specified
//! as a modified version of the list of successors of _x_ − _r_; if _r_
//! is 0, then the list of successors will be specified explicitly;
//! 2. if _r_ is nonzero:
//! - a natural number β, the _block count_;
//! - a sequence of β natural numbers *B*₁, …, *B*ᵦ, called the
//! _copy-block list_; only the first number can be zero.
//!
//! 3. Then comes the _extra part_, specifying additional entries that the list
//! of successors contains (or all of them, if _r_ is zero), that is:
//! 1. If the minimum interval length is finite:
//! - an integer _i_, the _interval count_;
//! - a sequence of _i_ pairs, whose first component is the left extreme
//! of an interval, and whose second component is the length of the
//! interval (the number of integers contained in it).
//! 2. Finally, the list of _residuals_, which contain all successors not
//! specified by previous methods.
//!
//! The above data should be interpreted as follows:
//!
//! - The reference part, if present (i.e., if both the window size and the
//! reference are positive), specifies that part of the list of successors of
//! node _x_ − _r_ should be copied; the successors of node _x_ − _r_ that
//! should be copied are described in the copy-block list; more precisely, one
//! should copy the first *B*₁ entries of this list, discard the next *B*₂,
//! copy the next *B*₃, etc. (the last remaining elements of the list of
//! successors will be copied if β is even, and discarded if β is odd).
//!
//! - The extra part specifies additional successors (or all of them, if the
//! reference part is absent); the extra part is not present if the number of
//! successors that are to be copied according to the reference part already
//! coincides with the outdegree of _x_; the successors listed in the extra
//! part are given in two forms:
//! - some of them are specified as belonging to (integer) intervals, if the
//! minimum interval length is finite; the interval count indicates how many
//! intervals, and the intervals themselves are listed as pairs (left
//! extreme, length);
//! - the residuals are the remaining "scattered" successors.
//!
//! ## How Successor Lists Are Coded
//!
//! The list of integers corresponding to each successor list is coded into a
//! sequence of bits. This is done in two phases: we first modify the sequence
//! so to obtain another sequence of integers (some of them might be negative).
//! Then each integer is coded, using a coding that can be specified as an
//! option; the integers that may be negative are first turned into natural
//! numbers using the standard bijection.
//!
//! 1. The outdegree of the node is left unchanged, as well as the reference and
//! the block count.
//! 2. All blocks are decremented by 1, except for the first one.
//! 3. The interval count is left unchanged.
//! 4. All interval lengths are decremented by the minimum interval length.
//! 5. The first left extreme is expressed as its difference from _x_ (it will
//! be negative if the first extreme is less than _x_); the remaining left
//! extremes are expressed as their distance from the previous right extreme
//! plus 2 (e.g., if the interval is \[5..11\] and the previous one was
//! \[1..3\], then the left extreme 5 is expressed as 5 − (3 + 2) = 0).
//! 6. The first residual is expressed as its difference from _x_ (it will be
//! negative if the first residual is less than _x_); the remaining residuals
//! are expressed as decremented differences from the previous residual.
//!
//! # The Offsets File
//!
//! Since the graph is stored as a bitstream, we must have some way to know
//! where each successor list starts. This information is stored in the offset
//! file, which contains the bit offset of each successor list as a γ-coded gap
//! from the previous offset (in particular, the offset of the first successor
//! list will be zero). As a convenience, the offset file contains an additional
//! offset pointing just after the last successor list (providing, as a
//! side-effect, the actual bit length of the graph file).
//!
//! For random access, the list of offsets is stored as an [Elias–Fano]
//! representation using [ε-serde]. Building such a representation is a
//! prerequisite for random access and can be done using the `webgraph build ef`
//! command.
//!
//! [BvGraph paper]: <http://vigna.di.unimi.it/papers.php#BoVWFI>
//! [Elias–Fano]: <https://docs.rs/sux/latest/sux/dict/elias_fano/struct.EliasFano.html>
//! [ε-serde]: <https://docs.rs/epserde/latest/epserde/>
use Path;
use crate*;
pub const GRAPH_EXTENSION: &str = "graph";
pub const PROPERTIES_EXTENSION: &str = "properties";
pub const OFFSETS_EXTENSION: &str = "offsets";
pub const EF_EXTENSION: &str = "ef";
pub const LABELS_EXTENSION: &str = "labels";
pub const LABELOFFSETS_EXTENSION: &str = "labeloffsets";
pub const DEG_CUMUL_EXTENSION: &str = "dcf";
use ;
use DeserInner;
pub use OffsetDegIter;
pub use BvGraphSeq;
pub use BvGraph;
pub use MaskedIter;
pub use *;
pub use *;
pub use *;
use IndexedSeq;
/// The default version of EliasFano we use for the CLI.
pub type EF = EliasFano;
/// Compound trait expressing the trait bounds for offsets.
///
/// See the [`MemCase`](epserde::deser::MemCase) documentation for an
/// explanation as to why we bound first with [`DeserInner`] and then require
/// the bound we are interested in on the associated deserialization type.
/// The default type we use for the cumulative function of degrees.
///
/// It provides an [indexed dictionary](sux::traits::indexed_dict::IndexedDict) with
/// [successor](sux::traits::indexed_dict::Succ) and [predecessor](sux::traits::indexed_dict::Pred) support.
///
/// This is the type returned by [`crate::traits::labels::SequentialLabeling::build_dcf`].
pub type DCF = EliasFano;
/// Checks that the offsets stored in the offsets file with given
/// basename are correct for the given [`BvGraphSeq`].