Benchmarking Infrastructure
The goal of the benchmarking infrastructure is to make performance testing and development easier by providing a "one click" runner for benchmarks with machine readable output.
Usage
To get started, run
which will print to stdout the following JSON schema:
This is a skeleton of the input used to run benchmarks.
search_directories: A list of directories that can be searched for input files.output_directory: A single output directory where index files may be saved to, and where the benchmark tool will look for any loaded indices that aren't specified by absolute paths.jobs: A list of benchmark-compatible inputs. Benchmarking will run each job sequentially, and write the outputs to a file.
jobs should contain objects that look like the following:
In the case of loading an already constructed index rather than building, the "source" field should look like:
Finding Inputs
Registered inputs are queried using
which will list something like
Available input kinds are listed below:
async-index-build
async-index-build-pq
To obtain the JSON schema for an input, add its name to the query like
which will generate something like
The above can be placed in the jobs array of the skeleton file.
Any number of inputs can be used.
NOTE:: The contents of each JSON file may (and in some cases, must) be modified. In particular, files paths such as
"data","queries", and"groundtruth"must be edited and point to valid.binfiles or the correct type. These paths can be kept as relative paths, benchmarking will look for relative paths among thesearch_directoriesin the input skeleton.
NOTE:: Target recall is a more advanced feature than
search_l. If it is defined,search_ldoes not need to be, but both are compatible together. This feature works by taking a sample of of the query set and using it to determine search_l prior to running the main query set. This is a way of performing automating tuning for a workload. The target is the recall target you wish to achieve. The percentile is the hops percnetile to achieve the target recall i.e. 0.95 indicates 95% of the queries in the sampled set will be above the recall target. max_serach_l is the maximum time we will serach to find our tuned recall target. This value should be relatively large to prevent failure. If you notice that you your tuned search_l is close to max_search_l it is advised to run again with a larger value. Finally, calibration_size is the number of qureies that are sampled to calculate recall values during the tuning process. Note that these will be reused for benchmarking later.
Finding Benchmarks
Registered benchmarks are queries using the following.
Example output is shown below:
Registered Benchmarks:
async-full-precision-f32: tag: "async-index-build", float32
async-full-precision-f16: tag: "async-index-build", float16
async-pq-f32: tag: "async-index-build-pq", float32
The keyword after "tag" corresponds to the type of input that the benchmark accepts.
Running Benchmarks
Benchmarks are run with
A benchmark run happens in several phases. First, the input file is parsed and simple data invariants are checked such as matching with valid input types, verifying the numeric range of some integers, and more. After successful deserialization, more pre-flight checks are conducted. This consists of:
-
Checking that all input files referenced exist on the file system as files. Input file paths that aren't absolute paths will also be searched for among the list of search directories in order. If any file cannot be resolved, an error will be printed and the process aborted.
-
Any additional data invariants that cannot be checked at deserialization time will also be checked.
-
Matching inputs to benchmarks happens next. To help with compile times, we only compile a subset of the supported data types and compression schemes offered by DiskANN. This means that each registered benchmark may only accept a subset of values for an input. Backend validation makes sure that each input can be matched with a benchmark and if a match cannot be found, we attempt to provide a list of close matches.
Once all checks have succeeded, we begin running benchmarks. Benchmarks are executed sequentially and store their results in an arbitrary JSON format. As each benchmark completes, all results gathered so far will be saved to the specified output file. Note that long-running benchmarks can also opt-in to incrementally saving results while the benchmark is running. This incremental saving allows benchmarks to be interrupted without data loss.
In addition to the machine-readable JSON output files, a (hopefully) helpful summary of the
results will be printed to stdout.
Streaming Runs
Running the benchmark on a streaming workload is similar to other registered benchmarks,
relying on the file formats and streaming runbooks of big-ann-benchmarks
First, set up the runbook and ground truth for the desired workload. Refer to the README in
big-ann-benchmarks/neurips23 and the runbooks in big-ann-benchmarks/neurips23/streaming.
Benchmarks are run with
Note in the example json that the benchmark is registered under async-dynamic-index-run,
instead of async-index-build etc..
A streaming run happens in several phases. First, the input file is parsed and data is checked for its validity. The check consists of
- All input files referenced can be found in the file system.
- The ground truth files required by the search stages in the runbook exist in
gt_directory, which will be searched undersearch_directories. For each search stage x (1-indexed), the gt directory should contain exactly onestep{x}.gt{k}.
The input file will then be matched to the proper dispatcher, similar to the static case of the
benchmark. At the end of the benchmark run, structured results will be saved to output-file
and a summary of the statistics will be pretty-printed to stdout.
The streaming benchmark implements the user layer of the index. Specifically, it tracks the tags
of vectors (ExternalId in the rust codebase) and matches agains the slots (InternalId in the
rust codebase), looking up correct vectors in the raw data by its sequential id for Insert and
Replace operations. If the index will run out of its allocated number of slots, the streaming
benchmark calls drop_deleted_neighbors (with only_orphans currently set to false) across all update
threads, then calls release on the Delete trait of the DataProvider to release the slots. On
Search operations, the streaming benchmark takes care of translating the slots that the index returns
to tags stored in the ground truth. These user logic are guarded by invariant checking in the benchmark.
This is designed to be compatible with the fact that ExternalId and InternalId are the same in the
barebone rust index and is separately handled by its users at the time when the streaming benchmark is
added. See benchmark/src/utils/streaming.rs for details. The integration tests for this
can be run by cargo test -p benchmark streaming.
Adding New Benchmarks
The benchmarking infrastructure uses a loosely-coupled method for dispatching benchmarks
broken into the front end (inputs) and the back end (benchmarks). Inputs can be any serde
compatible type. Input registration happens by registering types implementing
diskann_benchmark_runner::Input with the diskann_benchmark_runner::registry::Inputs
registry. This is done in inputs::register_inputs. At run time, the front end will discover
benchmarks in the input JSON file and use the tag string in the "contents" field to select
the correct input deserializer.
Benchmarks need to be registered with diskann_benchmark_runner::registry::Benchmarks by
registering themselves in benchmark::backend::load(). To be discoverable by the front-end
input, a DispatchRule from the dispatcher crate (via
diskann_benchmark_runner::dispatcher) needs to be defined matching a back-end type to
diskann_benchmark_runner::Any. The dynamic type in the Any will come from one of the
registered diskann_benchmark_runner::Inputs.
The rule can be as simple as checking a down cast or as complicated such as lifting run-time information to the type/compile time realm, as is done for the async index tests for the data type.
Once this is complete, the benchmark will be reachable by its input and can live peacefully with the other benchmarks.
Example
Defining a new Input Type
Here, we will walk through adding a very simple "compute_groundtruth" set of benchmarks.
First, define an input type in src/benchmark/inputs.
This may look something like the following.
use ;
// We derive from `Serialize` and `Deserialize` to be JSON compatible.
pub
We need to implement a few traits related to this input type:
-
diskann_benchmark_runner::Input: A type-name for this input that is used to identify it for deserialization and benchmark matching. To make this easier,benchmarkdefinesbenchmark::inputs::Inputthat can be used to express type level implementation (shown below) -
CheckDeserialization: This trait performs post-deserialization invariant checking. In the context of theComputeGroundTruthtype, we use this to check that the input files are valid.
Front End Registration
With the new input type ready, we can register it with the
diskann_benchmark_runner::registry::Inputs registry. This can be as simple as:
Note that registration can fail if multiple inputs have the sametag.
When these steps are completed, our new input will be available using
and
will display an example JSON input for our type.
Back End Benchmarks
So far, we have created a new input type and registered it with the front end, but there are
not any benchmarks that use this type. To implement new benchmarks, we need register them
with the diskann_benchmark_runner::registry::Benchmarks returned from
benchmark::backend::load(). The simplest thing we can do is something like this:
use ;
// Allows the dispatcher to try to match a value with type `CentralDispatch` to the receiver
// type `ComputeGroundTruth`.
What happening here is that the implementation of DispatchRule provides a valid conversion
from &Any to &ComputeGroundTruth, which is only applicable if runtime value in the Any
is the ComputeGroundTruth struct. If this happens, the benchmarking infrastructure will
call the closure passed to benchmarks.register() after calling DispatchRule::convert()
on the Any. This mechanism allows multiple backend benchmarks to exist and pull input from
the deserialized inputs present in the current run.
There are three more things to note about closures (benchmarks) that get registered with the dispatcher:
-
The argument
checkpoint: diskann_benchmark_runner::Checkpoint<'_>allows long-running benchmarks to periodically save incremental results to file by calling the.checkpointmethod. Benchmark results are anything that implementsserde::Serialize. This function creates a new snapshot every time it is invoked, so benchmarks to not need to worry about redundant data. -
The argument
output: &mut dyn diskann_benchmark_runner::Outputis a dynamic type where all output should be written too. Additionally, it provides aProgressDrawTargetfor use with indicatif progress bars. This supports output redirection for integration tests and piping to files. -
The return type from the closure should be
anyhow::Result<serde_json::Value>. This contains all data collected from the benchmark and will be collected and saved along with all other runs. Benchmark implementations do not need to worry about saving their input as well as this is automatically handled by the benchmarking infrastructure.
With the benchmark registered, that is all that is needed.
Expanding DispatchRule
The functionality offered by DispatchRule is much more powerful than what was described in
the simple example. In particular, careful implementation will allow your benchmarks to be
more easily discoverable from the command-line and can also assist in debugging by providing
"near misses".
Fine Grained Matching
The method DispatchRule::try_match returns both a successful MatchScore and an
unsuccessful FailureScore. The dispatcher will only invoke methods where all arguments
return successful MatchScores. Additionally, it will call the method with the "best"
overall score, determined by lexicographic ordering. So, you can make some registered
benchmarks "better fits" for inputs returning a better match score.
When the dispatcher cannot find any matching method for an input, it begins a process of
finding the "nearest misses" by inspecting and ranking methods based on their FailureScore.
Benchmarks can opt-in to this process by returning meaning FailureScores when an input is
close, but not quite right.
Benchmark Description and Failure Description
The trait DispatchRule has another method:
;
This is used for self-documenting the dispatch rule: If from is None, then
implementations should write to the formatter f a description of the benchmark and what
inputs it can work with. If from is Some, then implementation should write the reason
for a successful or unsuccessful match with the enclosed value. Doing these two steps make
error reporting in the event of a dispatch fail much easier for the user to understand and fix.
Refer to implementations within the benchmarking framework for what some of this may look like.