Expand description
Unordered key-value store with all data held in memory but also persisted to disk for durability.
The store is designed to be used as a building block for (distributed) systems that require a high-throughput, ultra low latency, unordered key-value store with persistence guarantee and can shard the data so that it always fits into the RAM of one machine.
§Design goals
- Lightweight and with few dependencies
- Support concurrent read/writes with minimal locking times
- Tunable write throughput / persistence guarantee trade-off
- Maximize block device throughput and exploit I/O parallelism if supported by OS and hardware
- Amortized memory and disk usage is O(number of keys), <10% overhead over payload size (e.g. no spikes during snapshotting)
- Support for both fixed-size and variable-size keys and values
A good mental model is “Hashmap that keeps its contents between program runs”. If more advanced database features are required, RocksDB or SQLite are a better choice.
§Key and value format
Both the key and the value side of the store operate exclusively on byte sequences,
converted from high level types via the crate::Serializable
and crate::Deserializable
traits. This avoids deviation between in-memory representation of objects and their serialized
form, which could otherwise lead to subtle bugs. As a result of operating on serialized
data only, none of Send
and Sync
and none of the hash map traits Hash
or
Eq
are technically required for the key and value types.
All fixed-size integer types, Vec<u8>
and String
are supported out of the box
as both keys and values. On the value side, prost::Message
types are also directly
supported using the xxx_proto family of methods.
§Implementation notes
Persistence is implemented via a write-ahead log that is periodically compacted and
replaced by full snapshots. All files are stored in a single folder. Individual snapshots
can be internally sharded to ease parallel processing and keep file size reasonable.
See the snapshot_set
module for details on snapshot format.
The on-disk format is a series of records. Each record is a protobuf message prefixed by a varint encoded length.
§Performance notes
If performance or throughput is a concern, you must benchmark and tune store configuration for the exact hardware and OS you are targeting.
Defaults in Config
are ok as a starting point and were derived as follows:
-
Linux sees a 2-3x improvement in write throughput when using positioned writes (enabled) but the same setting has slightly negative effects on Windows (disabled).
-
No OS seems to benefit from sharding the write-ahead log (default is 1)
-
Target parallelism for snapshot reads/writes is limited by I/O controller concurrency which varies by device type (default is 8 which should suit most modern SSDs).
-
The number of memory buckets is never a huge factor, as a rule of thumb it should be above the number of simultaneous readers (default is 32)
Modules§
Structs§
Enums§
Traits§
- Deserializable
- Trait for deserializing a type from a byte slice.
- Serializable
- Trait for serializing a type to a byte slice or a fixed size byte array.