[−][src]Crate clepsydra
Overview
This is a work-in-progress implementation of a core protocol for a minimalist distributed database. It strives to be as small and simple as possible while attempting to provide relatively challenging features:
-
Strict Serializability
-
Online Reconfiguration
-
Fault tolerance
-
High throughput
The implementation is based on a simplified version of the "Ocean Vista" (OV) protocol, and uses its terminology wherever possible. OV combines replication, transaction commitment and concurrency control into a single protocol.
Summary
The short version of the protocol is:
-
Transactions are represented as deterministic thunks over snapshots.
-
Each transaction is assigned a globally-unique timestamp.
-
Transactions are separated into two phases: S-phase and E-phase.
-
S-phase (storage) consists of coordination-free "blind quorum-writes" replicating the thunks into their MVCC order on each replica.
-
A watermark tracking minimum transaction timestamps-being-written is gossiped between peers, increasing as quorum-writes complete.
-
A transaction only enters E-phase after the watermark advances past it.
-
E-phase (evaluation) quorum-reads and evaluates thunks from consistent snapshots below the watermark, lazily resolving any earlier thunks. Everything below the watermark is coordination-free and deterministic.
Caveats
Nothing's perfect, and this crate is anything but:
-
This crate is very incomplete and does not work yet. Don't use it for anything other than experiments and toys. Recovery, reconfiguration, timeouts and nontrivial fault tolerance paths definitely don't work.
-
It also (somewhat recklessly) attempts to combine OV's reconfiguration and gossip protocols into an instance of the concorde reconfigurable lattice agreement protocol. This might not even be theoretically safe.
-
It is much more minimal than the full OV protocol: there's no support for sharding, nor the two-level peer-vs-datacenter locality organization. This crate treats its whole peer group as a single symmetric shard.
-
As a result, performance won't be "webscale" or anything. It will scale vertically if you throw cores at it, but no better, and its latency will always have speed-of-light WAN RTT factors in it. It's distributed for fault tolerance, not horizontal scaling.
-
As with OV, this crate does require partial clock synchronization. It doesn't need to be very tight: clock drift only causes increased latency as the watermarks progress as the minimum of all times; it doesn't affect correctness. Normal weak-NTP-level sync should be ok.
-
As with OV, Calvin, and all deterministic databases: your txns have to be deterministic and must have deterministic read and write sets. If they cannot have their read and write sets statically computed (eg. if they rely on the data to decide read and write set) you have to build slightly awkward multi-phase txns. The term in the literature is "reconnaisance queries".
Reference
Hua Fan and Wojciech Golab. Ocean Vista: Gossip-Based Visibility Control for Speedy Geo-Distributed Transactions. PVLDB, 12(11): 1471-1484, 2019.
DOI: https://doi.org/10.14778/3342263.3342627
http://www.vldb.org/pvldb/vol12/p1471-fan.pdf
Name
Wikipedia:
A water clock or clepsydra (Greek κλεψύδρα from κλέπτειν kleptein, 'to steal'; ὕδωρ hydor, 'water') is any timepiece by which time is measured by the regulated flow of liquid into (inflow type) or out from (outflow type) a vessel, and where the amount is then measured.
Structs
Database | Main object that clients instantiate. Encapsulates the remainder of the system. Clients need to provide a Clock and Store, as well as some number of async IO connections to other peers. |
GlobalTime | GlobalTimes are the fundamental timekeeping type in the system. |
KeyVer | A |
PeerID | A simple "peer identifier" which should be unique across any present or future configuration of a peer group. A randomly-chosen u64 should suffice. |
RealClock | An implementation of Clock that calls std::time::SystemTime::now. |
Sdw | A server-specific delayed-evaluation watermark, denoting the minimum of the crate::KeyVers held in the server's crate::Store that remain in state crate::Entry::Delayed (not-yet-evaluated). |
TestClock | An implementation of Clock that holds a shared AtomicU64 representing
the current millisecond count since the epoch, that increments on each
call to |
Enums
Entry | An Entry is associated with each KeyVer (i.e. a Lang::Key at some GlobalTime) in the Store, and is either an unevaluated expression of type Entry::Delayed (at one of two possible ReplicationTag levels), or an Entry::Aborted tombstone (if replication fails), or an Entry::Settled value carrying a fully-evaluated ExtVal. |
Error | |
ExtVal | An |
ReplicationTag | Designates which phase of quorum-writing a given crate::Entry was stored in. Must be stored and retieved along with entries in the crate::Store. |
Traits
Clock | Trait to support multiple sorts of clock-source. |
Lang | A |
Store | A |