This is a work-in-progress implementation of a core protocol for a minimalist distributed database. It strives to be as small and simple as possible while attempting to provide relatively challenging features:
The implementation is based on a simplified version of the "Ocean Vista" (OV) protocol, and uses its terminology wherever possible. OV combines replication, transaction commitment and concurrency control into a single protocol.
The short version of the protocol is:
Transactions are represented as deterministic thunks over snapshots.
Each transaction is assigned a globally-unique timestamp.
Transactions are separated into two phases: S-phase and E-phase.
S-phase (storage) consists of coordination-free "blind quorum-writes" replicating the thunks into their MVCC order on each replica.
A watermark tracking minimum transaction timestamps-being-written is gossiped between peers, increasing as quorum-writes complete.
A transaction only enters E-phase after the watermark advances past it.
E-phase (evaluation) quorum-reads and evaluates thunks from consistent snapshots below the watermark, lazily resolving any earlier thunks. Everything below the watermark is coordination-free and deterministic.
Nothing's perfect, and this crate is anything but:
This crate is very incomplete and does not work yet. Don't use it for anything other than experiments and toys. Recovery, reconfiguration, timeouts and nontrivial fault tolerance paths definitely don't work.
It also (somewhat recklessly) attempts to combine OV's reconfiguration and gossip protocols into an instance of the concorde reconfigurable lattice agreement protocol. This might not even be theoretically safe.
It is much more minimal than the full OV protocol: there's no support for sharding, nor the two-level peer-vs-datacenter locality organization. This crate treats its whole peer group as a single symmetric shard.
As a result, performance won't be "webscale" or anything. It will scale vertically if you throw cores at it, but no better, and its latency will always have speed-of-light WAN RTT factors in it. It's distributed for fault tolerance, not horizontal scaling.
As with OV, this crate does require partial clock synchronization. It doesn't need to be very tight: clock drift only causes increased latency as the watermarks progress as the minimum of all times; it doesn't affect correctness. Normal weak-NTP-level sync should be ok.
As with OV, Calvin, and all deterministic databases: your txns have to be deterministic and must have deterministic read and write sets. If they cannot have their read and write sets statically computed (eg. if they rely on the data to decide read and write set) you have to build slightly awkward multi-phase txns. The term in the literature is "reconnaisance queries".
Hua Fan and Wojciech Golab. Ocean Vista: Gossip-Based Visibility Control for Speedy Geo-Distributed Transactions. PVLDB, 12(11): 1471-1484, 2019.
A water clock or clepsydra (Greek κλεψύδρα from κλέπτειν kleptein, 'to steal'; ὕδωρ hydor, 'water') is any timepiece by which time is measured by the regulated flow of liquid into (inflow type) or out from (outflow type) a vessel, and where the amount is then measured.
GlobalTimes are the fundamental timekeeping type in the system.
A simple "peer identifier" which should be unique across any present or future configuration of a peer group. A randomly-chosen u64 should suffice.
An Entry is associated with each KeyVer (i.e. a Lang::Key at some GlobalTime) in the Store, and is either an unevaluated expression of type Entry::Delayed (at one of two possible ReplicationTag levels), or an Entry::Aborted tombstone (if replication fails), or an Entry::Settled value carrying a fully-evaluated ExtVal.
Trait to support multiple sorts of clock-source.