Module archive

Source
Expand description

A write-once key-value store optimized for low-latency reads.

Archive is a key-value store designed for workloads where all data is written only once and is uniquely associated with both an index and a key.

Data is stored in Journal (an append-only log) and the location of written data is stored in-memory by both index and key (truncated representation using a caller-provided Translator) to enable single-read lookups for both query patterns over all archived data.

Notably, Archive does not make use of compaction nor on-disk indexes (and thus has no read nor write amplification during normal operation).

§Format

Archive stores data in the following format:

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |12 |13 |14 |15 |16 |      ...      |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|          Index(u64)           |  Key(Fixed Size)  |    C(u32)     |     Data      |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

C = CRC32(Key)

To ensure keys fetched using Journal::get_prefix are correctly read, the index and key are checksummed within a Journal entry (although the entire entry is also checksummed by Journal).

§Uniqueness

Archive assumes all stored indexes and keys are unique. If the same key is associated with multiple indices, there is no guarantee which value will be returned. If the key is written to an existing index, Archive will return an error.

§Conflicts

Because a truncated representation of a key is only ever stored in memory, it is possible (and expected) that two keys will eventually be represented by the same truncated key. To handle this case, Archive must check the persisted form of all conflicting keys to ensure data from the correct key is returned. To support efficient checks, Archive keeps a linked list of all keys with the same truncated prefix:

struct Record {
    index: u64,

    next: Option<Box<Record>>,
}

To avoid random memory reads in the common case, the in-memory index directly stores the first item in the linked list instead of a pointer to the first item.

index is the key to the map used to serve lookups by index that stores the location of data in a given Blob (selected by section = index & section_mask to minimize the number of open Journals):

struct Location {
    offset: u32,
    len: u32,
}

If the Translator provided by the caller does not uniformly distribute keys across the key space or uses a truncated representation that means keys on average have many conflicts, performance will degrade.

§Memory Overhead

Archive uses two maps to enable lookups by both index and key. The memory used to track each index item is 8 + 4 + 4 (where 8 is the index, 4 is the offset, and 4 is the length). The memory used to track each key item is ~truncated(key).len() + 16 bytes (where 16 is the size of the Record struct). This means that an Archive employing a Translator that uses the first 8 bytes of a key will use ~40 bytes to index each key.

§Sync

Archive flushes writes in a given section (computed by index & section_mask) to Storage after pending_writes. If the caller requires durability on a particular write, they can call sync.

§Pruning

Archive supports pruning up to a minimum index using the prune method. After prune is called on a section, all interaction with a section less than the pruned section will return an error.

§Lazy Index Cleanup

Instead of performing a full iteration of the in-memory index, storing an additional in-memory index per section, or replaying a section of Journal, Archive lazily cleans up the in-memory index after pruning. When a new key is stored that overlaps (same truncated value) with a pruned key, the pruned key is removed from the in-memory index.

§Single Operation Reads

To enable single operation reads (i.e. reading all of an item in a single call to Blob), Archive caches the length of each item in its in-memory index. While it increases the footprint per key stored, the benefit of only ever performing a single operation to read a key (when there are no conflicts) is worth the tradeoff.

§Compression

Archive supports compressing data before storing it on disk. This can be enabled by setting the compression field in the Config struct to a valid zstd compression level. This setting can be changed between initializations of Archive, however, it must remain populated if any data was written with compression enabled.

§Querying for Gaps

Archive tracks gaps in the index space to enable the caller to efficiently fetch unknown keys using next_gap. This is a very common pattern when syncing blocks in a blockchain.

§Example

use commonware_runtime::{Spawner, Runner, deterministic::Executor};
use commonware_storage::archive::{Archive, Config, translator::FourCap};
use commonware_storage::journal::{Error, variable::{Config as JConfig, Journal}};
use prometheus_client::registry::Registry;
use std::sync::{Arc, Mutex};

let (executor, context, _) = Executor::default();
executor.start(async move {
    // Create a journal
    let cfg = JConfig {
        registry: Arc::new(Mutex::new(Registry::default())),
        partition: "partition".to_string()
    };
    let journal = Journal::init(context, cfg).await.unwrap();

    // Create an archive
    let cfg = Config {
        registry: Arc::new(Mutex::new(Registry::default())),
        key_len: 8,
        translator: FourCap,
        section_mask: 0xffff_ffff_ffff_0000u64,
        pending_writes: 10,
        replay_concurrency: 4,
        compression: Some(3),
    };
    let mut archive = Archive::init(journal, cfg).await.unwrap();

    // Put a key
    archive.put(1, b"test-key", "data".into()).await.unwrap();

    // Close the archive (also closes the journal)
    archive.close().await.unwrap();
});

Modules§

translator

Structs§

Archive
Implementation of Archive storage.
Config
Configuration for Archive storage.

Enums§

Error
Errors that can occur when interacting with the archive.
Identifier
Subject of a get or has operation.

Traits§

Translator
Translate keys into an internal representation used in Archive’s in-memory index.