Expand description
This crate provides the Roots
type, which is the transactional storage layer
for BonsaiDb
. It is loosely inspired by
Couchstore
.
This crate blocks the current thread when accessing the filesystem. If you are looking for an async-ready database, BonsaiDb is our vision of an async-aware database built atop Nebari.
This crate is alpha. While its format is considered stable, there may be bugs that could lead to data loss. Please have a good backup strategy while using this crate.
Examples
Inserting a key-value pair in an on-disk tree with full revision history:
use nebari::{
tree::{Root, Versioned},
Config,
};
let database_folder = tempfile::tempdir().unwrap();
let roots = Config::default_for(database_folder.path())
.open()
.unwrap();
let tree = roots.tree(Versioned::tree("a-tree")).unwrap();
tree.set("hello", "world").unwrap();
For more examples, check out nebari/examples/
.
Features
Nebari exposes multiple levels of functionality. The lowest level functionality
is the
TreeFile
.
A TreeFile
is a key-value store that uses an append-only file format for its
implementation.
Using TreeFile
s and a transaction log,
Roots
enables
ACID-compliant, multi-tree transactions.
Each tree supports:
- Key-value storage: Keys can be any arbitrary byte sequence up to 65,535 bytes long. For efficiency, keys should be kept to smaller lengths. Values can be up to 4 gigabytes (2^32 bytes) in size.
- Flexible query options: Fetch records one key at a time, multiple keys at once, or ranges of keys.
- Powerful multi-key operations: Internally, all functions that alter the
data in a tree use
TreeFile::modify()
which allows operating on one or more keys and performing various operations. - Pluggable low-level modifications: The
Vault
trait allows you to bring your own encryption, compression, or other functionality to this format. Each independently-addressible chunk of data that is written to the file passes through the vault. - Optional full revision history. If you don’t want to lose old revisions of
data, you can use a
VersionedTreeRoot
to store information that allows scanning old revision information. Or, if you want to avoid the extra IO, use theUnversionedTreeRoot
which only stores the information needed to retrieve the latest data in the file. - ACID-compliance:
-
Atomicity: Every operation on a
TreeFile
is done atomically.Operation::CompareSwap
can be used to perform atomic operations that require evaluating the currently stored value. -
Consistency: Atomic locking operations are used when publishing a new transaction state. This ensures that readers can never operate on a partially updated state.
-
Isolation: Currently, each tree can only be accessed exclusively within a transaction. This means that if two transactions request the same tree, one will execute and complete before the second is allowed access to the tree. This strategy could be modified in the future to allow for more flexibility.
-
Durability: The append-only file format is designed to only allow reading data that has been fully flushed to disk. Any writes that were interrupted will be truncated from the end of the file.
Transaction IDs are recorded in the tree headers. When restoring from disk, the transaction IDs are verified with the transaction log. Because of the append-only format, if we encounter a transaction that wasn’t recorded, we can continue scanning the file to recover the previous state. We do this until we find a successfluly commited transaction.
This process is much simpler than most database implementations due to the simple guarantees that append-only formats provide.
-
Why use an append-only file format?
@ecton wasn’t a database engineer before starting this project, and depending on your viewpoint may still not be considered a database engineer. Implementing ACID-compliance is not something that should be attempted lightly.
Creating ACID-compliance with append-only formats is much easier to achieve, however, as long as you can guarantee two things:
- When opening a previously existing file, can you identify where the last valid write occurred?
- When writing the file, do not report that a transaction has succeeded until the file is fully flushed to disk.
The B-Tree implementation in Nebari is designed to offer those exact guarantees.
The major downside of append-only formats is that deleted data isn’t cleaned up until a maintenance process occurs: compaction. This process rewrites the file’s contents, skipping over entries that are no longer alive. This process can happen without blocking the file from being operated on, but it does introduce IO overhead during the operation.
Nebari provides APIs that perform compaction, but currently delegates scheduling and automation to consumers of this library.
Open-source Licenses
This project, like all projects from Khonsu Labs, are open-source. This repository is available under the MIT License or the Apache License 2.0.
To learn more about contributing, please see CONTRIBUTING.md.
Modules
- IO abstractions for Nebari.
- ACID-compliant transaction log and manager.
- Append-only B-Tree implementation
Structs
- An immutable buffer of bytes that can be cloned, sliced, and read into multiple parts using a single refernce to the underlying buffer.
- A configurable cache that operates at the “chunk” level.
- A database configuration used to open a database.
- A shared environment for database operations.
- An error from Nebari as well as an associated backtrace.
- An executing transaction. While this exists, no other transactions can execute across the same trees as this transaction holds.
- A locked transaction tree. This transactional tree is exclusively available for writing and reading to the thread that locks it.
- A multi-tree transactional B-Tree database.
- A thread pool that commits transactions to disk in parallel.
- A tree that is modifiable during a transaction.
- A named collection of keys and values.
- A tree that belongs to an
ExecutingTransaction
.
Enums
- An error that could come from user code or Nebari.
- An error returned from
compare_and_swap()
. - An error from Nebari.
Traits
- A provider of encryption for blocks of data.