bstree-file-readonly
Make and Query read-only binary-search tree file supporting billions of entries in files of tens of GB.
About
Immutable implicit naive Binary Search Tree structure stored in a file.
The tree structure (possibly larger than the available RAM) is created at once using bulk-loading. It is then possible to perform queries on it (nn query, knn query, range query, ...), but not to update it.
It has been developed for (with a more general usage than) static astronomical catalogues.
The datastructure is implicit: it is basically a flat array of entries ordered in a pre-defined way depending on a few parameters like the number of elements in the tree, the size of both the L1 and the disk caches.
The simple design inputs are:
- a metadata part followed by the data part
- data part as compact as possible, but without compression
- => hence the choice of an implicit structure with an unbalanced rightmost part of the tree
Remark: I do not claim this is the best possible structure, it is a quite naive implementation by a non-expert, rushing to have something working among other duties; any feedback welcome.
CLI
Contains 3 CLI tools:
mkbst
: to MaKe a Binary Search Tree fileqbst
: to Query a Binary Search Tree filegenfile
: to generate files with random values for testing purposes
Purpose
Perform fast queries on a single catalogue column. The binary-search tree basically stores both values and OIDs (row indices).
In input, the tools takes an identifiers (which can be an implicit sequential number) and a value. The indexation is made on the values. Queries return entries which basically are tuples containing an identifier and value couple.
Creation algorithm
Although the first step is an external k-way merge sort, the final file is not ordered sequentially. It consists in a sequence of binary search tree blocks. There is two levels of blocks:
- blocks fitting in the L1 cache
- groups of blocks fitting in the disk cache The full tree is not balanced:
- it is made of a main balance tree
- plus a rightmost sub-tree recursively consisting in
- a main balanced tree +plus a rightmost sub-tree... The tree has 0 unused byte.
Disclaimer
Main functionalities are complete (building and querying), but all parts may not necessarily be production ready: more testing is needed (please report any bug).
However, this code is already in production in VizieR, mainly to index large catalogues identifiers.
For performances purposes, the code makes a large use of monomorphization (no dynamic dispatch at all!). It leads to:
- very long compilation time (1min/10min in debug/release mode)
- large binaries:
mkbst
(tree creation) is about 9/65 MB in release/debug modeqbst
(tree query) is about 29/116 MB in release/debug mode
Other tools
For a larger project that may fulfill the need (and more), see:
Info
In the while project, we talk about key/value
pairs.
Although it may be counter-intuitive, we actually index the value
(e.g. a magnitude, which is not necessarily unique) and
we return the associated key
(e.g. a unique record number in a table).
A few ideas:
- common use cases:
- index a unique identifier (value) and return a recno (key)
- index magnitudes (value) and return recnos (key)
- to get the position of a Gaia source from its identifier:
index the unique identifier (value) and return formatted JNAMEs (string key). - for fast magnitude based density maps: index a magnitude (value) and return the order 12 healpix index (key).
- index CSV rows using as key row starting byte offset
Install
The standard way to install the mkbst
, qbst
and genfile
binaries is:
- install rust see here, possibly removing
--tlsv1.2
in the command line - fork and download this repository
- type
cargo install --path .
from the downloaded directory (can take ~1-2min) - WARNING: by default only a subset of (key, value) pair is available. For all possibilities, use
cargo install --path . --features "all"
See Cargo.toml for the list of features.
Fast compilation to test for a two key/value datatype couples, e.g. (unsigned integer/unsigned integer) and (unsigned integer/float):
Full compilation (can tae 10min!):
Example
Generate data (deprecated)
Bash script mkseq.bash to generate a simple sequence of value from 0 to n
:
#!/bin/bash
COUNTER=0
until [; do
done
Remark: one can use the shuf
command (first removing and then adding val
) to shuffle the input lines.
Generate data
The genfile
tool generates simple files to test the BSTree code.
Example:
|
Build a Binary Search Tree on 10 billion single precision floats having values in [0.0, 1.0)
.
Generate a very simple sequential file containing integers with both the id and value columns:
Remarks:
- the purpose of a same sequential number for both the identifier and the value is to test that a value is associated with the correct identifier.
- one can use the
shuf
command (first removing and then addingval
) to shuffle the input lines.
Create a tree
Create a simple tree containing 2 billion entries consisting of tuples having the same identifiers (a row number) and value (the row number):
|
Both identifiers and values are stored on regular 32bit unsigned integers.
Perform queries
Perform queries:
Test on 10 million random f32 values
Generate 10 million random f64 and create a bstree storing id on 32 bit integers and value on 32bit floats
|
Look at the nearest value from 0.5
Look at the 10 nearest values from 0.2 (the result is ordered by distance to 0.2)
Count the number of entries havig value in 0.4 and 0.6
Priny the value in the range 0.49999 and 0.50001 (the result is ordered by increasing values)
Benchmark
Test on my personal desktop (MVNe SSD, 16 GB RAM, AMD Ryzen) on a 20 GB file containing sequential numbers (from 0 to 1999999999).
> time
It took less than 10min to build the 15 GB ouput file (ok, the input file is already sorted).
> qbst
{
}
Simple exact value query:
> time
Nearest neighbour query
> time
)
K nearest neighbours query
> time
)
Range query
> time
At first execution, the limiting factor is the disk access. At the second execution, the limiting factor is the time required by the OS to handle the process. The 2ms includes the time needed to read the tree metadata.
Generate 100,000 random points in 0 - 2billion:
| |
In release mode (third execution):
i.e. a mean of less than 8 micro second per query (including parsing, conversion to string and writing the result in a file)! (The mean is ~0.14 ms/query at the first execution)
> time
> time
> time
- First execution: 0.16 ms/query
- Third execution: 7.60 us/query
We redo those query sorting the random number:
| | |
The results are similar.
We recall that the index file is 15 GB large, so 2nd execution is faster since the data is in the disk cache.
Bench on 10 billion f32 random values in [0, 1)
Generate the value and the index at once:
&
Or in two steps
The size of the index is ~85 GB
.
The queries have been tested on 2 distinct hardware:
- Desktop computer
- 1 TB, 7200 RPM, 32 MB cache HDD (HGST HTS721010A9E630)
- Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz (6 MB "smart cache")
- 16 GB DDR4 2133 MHz
- Server
- RAID of 5 SSDs (Samsung SATA SSDs)
- 2x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (25 MB "smart cache")
- 64 GB DDR4 2133 MHz
The table return the "real" time provided by the "time" command.
Each command starts by time qbst test.10b.bstree
plus the Query params
.
Here the command to generate the list input file in the queries using list
:
| |
Query params | 1st or 2nd | Result | Desktop | Server |
---|---|---|---|---|
nn value 0.5 | 1 | 0m0,071s | 0m0.013s | |
nn value 0.5 | 2 | 0m0,004s | 0m0.003s | |
knn -v 0.8 -k 10 | 1 | 0m0,034s | 0m0.007s | |
knn -v 0.8 -k 10 | 2 | 0m0,004s | 0m0.003s | |
all -v 0.8 -c | 1/2 | 588 | 0m0,004s | 0m0.005s |
all -v 0.2 -c | 1 | 148 | 0m0,026s | 0m0.011s |
all -v 0.2 -c | 2 | 148 | 0m0,004s | 0m0.003s |
range -f 0.4 -t 0.5 -c | 1/2 | 1000028688 | 1m5,450s | 1m57.212s |
range -f 0.150 -t 0.149 -c | 1 | 157 | 0m0,028s | 0m0.025s |
range -f 0.150 -t 0.149 -c | 2 | 157 | 0m0,004s | 0m0.003s |
get list 1000.csv > res.csv | 1 | 0m9,392s | 0m0.251s | |
get list 1000.csv > res.csv | 2 | 0m0,026s | 0m0.034s | |
get list 10000.csv > res.csv | 1 | 1m40,354s | 0m3.807s | |
get list 10000.csv > res.csv | 2 | 0m0,181s | 0m0.207s | |
get list 100000.csv > res.csv | 1 | 0m24.548s |
In the last case (100000 queries, no data in the disk cache), the mean query time is about 0.24 ms (which is probably not far from the disk access time).
In the previous results, we clearly see the effect of the spinning vs SSD disk at the first execution.
On the query range -f 0.4 -t 0.5 -c
we see that the server has a slower CPU.
The query get list 10000.csv
is very interesting (a factor more than x20 in performances!):
- the mean time on a spinning disk is about 10 ms
- the mean time on a SSD disk is about 0.4 ms
For the query get list 100000.csv > res.csv
, the result time is about the
same the input being sorted or not.
Bench with Gaia DR2 data (1.6 Billion entries)
With this BSTree index
The use case is simple: from the Gaia DR2
Source
unique identifier, I want to retrieve the associated position (I am using the formatted position
%015.9%+015.9f
as a string made of 30 ASCII chars).
Thus here the value I want to index is Source
and the associated identifier is the position
(yes it is different from a HashMap in which Source
would be the (unique) key and the position the associated value
because in a bs-tree the indexed value is not necessarily unique).
In input, the files looks like:
and contains 1,69,2919,136 rows (including the header line).
I build and exec the following script
#!/bin/bash
LC_NUMERIC=C LC_COLLATE=C; \
| |\
while ; do ; done |\
(the process is quite slow due to the bash while
and printf
).
I then have a 2.5 GB file Gaia_source.txt
containing more 132,739,322 Source
, looking like:
Two consecutive executions (slow 7200 RPM HDD) gives:
and
I guess that the second execution benefits from HDD cache.
Now sorting the input and querying again leads to:
The output file is 6.3 GB large.
With PSQL10
I install PSQL10 on Ubuntu via apt
, create a user and move the database out of the system disk:
I create two tables (the index and the data tables) and copy data:
;
)
;
)
)
and
)
)
And perform the query:
Remarks:
- I tested PSQL out of the box, without modifying any parameters
- In the PSQL test, using a PRIMARY KEY for both tables, the result is to be compared with the sorted input in the BSTree test.
TODO list
- add the possibility to query by a list of target
- make a simple test with PSQL
- replace memory map by pread/pwrite? (see e.g. positioned-io or scroll)
- remove the code which is now obsolete (e.g.
get
overwritten byget exact visitor
) - add much more tests
- add benchmarks
- add CI
- try to reduce the code redundancy (particularly in
SubTreeW
andSubTreeR
) - add support for NULL values (storing them separately, out of the tree structure)
- in the meanwhile, we can use a pre-defined value coding null
- perform tests with SQLx and PostgreSQL to have a reference time (would be nice if we are at least as fast)
Acknowledgements
If you use this code and work in a scientific public domain (especially astronomy), please acknowledge its usage and the CDS who developed it. It may help us in promoting our work to our financiers.
Warning
If the compilation fails with a message like
, try
|
If the result looks like
)
it means that your machine was out of memory.
License
Like most projects in Rust, this project is licensed under either of
- Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.