Crate persistent_kv

Source
Expand description

Unordered key-value store with all data held in memory but also persisted to disk for durability.

The store is designed to be used as a building block for (distributed) systems that require a high-throughput, ultra low latency, unordered key-value store with persistence guarantee and can shard the data so that it always fits into the RAM of one machine.

§Design goals

  • Lightweight and with few dependencies
  • Support concurrent read/writes with minimal locking times
  • Tunable write throughput / persistence guarantee trade-off
  • Maximize block device throughput and exploit I/O parallelism if supported by OS and hardware
  • Amortized memory and disk usage is O(number of keys), <10% overhead over payload size (e.g. no spikes during snapshotting)
  • Support for both fixed-size and variable-size keys and values

A good mental model is “Hashmap that keeps its contents between program runs”. If more advanced database features are required, RocksDB or SQLite are a better choice.

§Key and value format

Both the key and the value side of the store operate exclusively on byte sequences, converted from high level types via the crate::Serializable and crate::Deserializable traits. This avoids deviation between in-memory representation of objects and their serialized form, which could otherwise lead to subtle bugs. As a result of operating on serialized data only, none of Send and Sync and none of the hash map traits Hash or Eq are technically required for the key and value types.

All fixed-size integer types, Vec<u8> and String are supported out of the box as both keys and values. On the value side, prost::Message types are also directly supported using the xxx_proto family of methods.

§Implementation notes

Persistence is implemented via a write-ahead log that is periodically compacted and replaced by full snapshots. All files are stored in a single folder. Individual snapshots can be internally sharded to ease parallel processing and keep file size reasonable. See the snapshot_set module for details on snapshot format.

The on-disk format is a series of records. Each record is a protobuf message prefixed by a varint encoded length.

§Performance notes

If performance or throughput is a concern, you must benchmark and tune store configuration for the exact hardware and OS you are targeting.

Defaults in Config are ok as a starting point and were derived as follows:

  1. Linux sees a 2-3x improvement in write throughput when using positioned writes (enabled) but the same setting has slightly negative effects on Windows (disabled).

  2. No OS seems to benefit from sharding the write-ahead log (default is 1)

  3. Target parallelism for snapshot reads/writes is limited by I/O controller concurrency which varies by device type (default is 8 which should suit most modern SSDs).

  4. The number of memory buckets is never a huge factor, as a rule of thumb it should be above the number of simultaneous readers (default is 32)

Modules§

snapshot_set

Structs§

Config
PersistentKeyValueStore

Enums§

SyncMode

Traits§

Deserializable
Trait for deserializing a type from a byte slice.
Serializable
Trait for serializing a type to a byte slice or a fixed size byte array.

Type Aliases§

Result