Expand description
A write-once key-value store optimized for low-latency reads.
Archive
is a key-value store designed for workloads where all data is written only once and is
uniquely associated with both an index
and a key
.
Data is stored in Journal
(an append-only log) and the location of written data is stored in-memory
by both index and key (truncated representation using a caller-provided Translator
) to
enable single-read lookups for both query patterns over all archived data.
Notably, Archive
does not make use of compaction nor on-disk indexes (and thus has no read nor
write amplification during normal operation).
§Format
Archive
stores data in the following format:
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |12 |13 |14 |15 |16 | ... |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| Index(u64) | Key(Fixed Size) | C(u32) | Data |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
C = CRC32(Key)
To ensure keys fetched using Journal::get_prefix
are correctly read, the index and key are
checksummed within a Journal
entry (although the entire entry is also checksummed by Journal
).
§Uniqueness
Archive
assumes all stored indexes and keys are unique. If the same key is associated with multiple
indices
, there is no guarantee which value will be returned. If the key is written to an existing index
,
Archive
will return an error.
§Conflicts
Because a truncated representation of a key is only ever stored in memory, it is possible (and expected)
that two keys will eventually be represented by the same truncated key. To handle this case, Archive
must check the persisted form of all conflicting keys to ensure data from the correct key is returned.
To support efficient checks, Archive
keeps a linked list of all keys with the same truncated prefix:
struct Record {
index: u64,
next: Option<Box<Record>>,
}
To avoid random memory reads in the common case, the in-memory index directly stores the first item in the linked list instead of a pointer to the first item.
index
is the key to the map used to serve lookups by index
that stores the location of data in a given
Blob
(selected by section = index & section_mask
to minimize the number of open Journals
):
struct Location {
offset: u32,
len: u32,
}
If the Translator
provided by the caller does not uniformly distribute keys across the key space or
uses a truncated representation that means keys on average have many conflicts, performance will degrade.
§Memory Overhead
Archive
uses two maps to enable lookups by both index and key. The memory used to track each index
item is 8 + 4 + 4
(where 8
is the index, 4
is the offset, and 4
is the length). The memory used to track
each key item is ~truncated(key).len() + 16
bytes (where 16
is the size of the Record
struct).
This means that an Archive
employing a Translator
that uses the first 8
bytes of a key will use ~40
bytes
to index each key.
§Sync
Archive
flushes writes in a given section
(computed by index & section_mask
) to Storage
after
pending_writes
. If the caller requires durability on a particular write, they can call sync
.
§Pruning
Archive
supports pruning up to a minimum index
using the prune
method. After prune
is called
on a section
, all interaction with a section
less than the pruned section
will return an error.
§Lazy Index Cleanup
Instead of performing a full iteration of the in-memory index, storing an additional in-memory index
per section
, or replaying a section
of Journal
, Archive
lazily cleans up the in-memory index
after pruning. When a new key is stored that overlaps (same truncated value) with a pruned key, the
pruned key is removed from the in-memory index.
§Single Operation Reads
To enable single operation reads (i.e. reading all of an item in a single call to Blob
), Archive
caches the length of each item in its in-memory index. While it increases the footprint per key stored, the
benefit of only ever performing a single operation to read a key (when there are no conflicts) is worth the
tradeoff.
§Compression
Archive
supports compressing data before storing it on disk. This can be enabled by setting the compression
field in the Config
struct to a valid zstd
compression level. This setting can be changed between initializations
of Archive
, however, it must remain populated if any data was written with compression enabled.
§Querying for Gaps
Archive
tracks gaps in the index space to enable the caller to efficiently fetch unknown keys using next_gap
.
This is a very common pattern when syncing blocks in a blockchain.
§Example
use commonware_runtime::{Spawner, Runner, deterministic::Executor};
use commonware_storage::archive::{Archive, Config, translator::FourCap};
use commonware_storage::journal::{Error, variable::{Config as JConfig, Journal}};
use prometheus_client::registry::Registry;
use std::sync::{Arc, Mutex};
let (executor, context, _) = Executor::default();
executor.start(async move {
// Create a journal
let cfg = JConfig {
registry: Arc::new(Mutex::new(Registry::default())),
partition: "partition".to_string()
};
let journal = Journal::init(context, cfg).await.unwrap();
// Create an archive
let cfg = Config {
registry: Arc::new(Mutex::new(Registry::default())),
key_len: 8,
translator: FourCap,
section_mask: 0xffff_ffff_ffff_0000u64,
pending_writes: 10,
replay_concurrency: 4,
compression: Some(3),
};
let mut archive = Archive::init(journal, cfg).await.unwrap();
// Put a key
archive.put(1, b"test-key", "data".into()).await.unwrap();
// Close the archive (also closes the journal)
archive.close().await.unwrap();
});
Modules§
Structs§
Enums§
- Error
- Errors that can occur when interacting with the archive.
- Identifier
- Subject of a
get
orhas
operation.
Traits§
- Translator
- Translate keys into an internal representation used in
Archive
’s in-memory index.