clepsydra 0.2.0

small distributed database protocol
Documentation
# clepsydra

## Overview

This is a work-in-progress implementation of a core protocol for a minimalist
distributed database. It strives to be as small and simple as possible while
attempting to provide relatively challenging features:

  - Strict Serializability

  - Online Reconfiguration

  - Fault tolerance

  - High throughput

The implementation is based on a simplified version of the "Ocean Vista"
(OV) protocol, and uses its terminology wherever possible. OV combines
replication, transaction commitment and concurrency control into a single
protocol.

### Summary

The short version of the protocol is:

  - Transactions are represented as deterministic thunks over snapshots.

  - Each transaction is assigned a globally-unique timestamp.

  - Transactions are separated into two phases: S-phase and E-phase.

  - S-phase (storage) consists of coordination-free "blind quorum-writes"
    replicating the thunks into their MVCC order on each replica.

  - A watermark tracking minimum transaction timestamps-being-written is
    gossiped between peers, increasing as quorum-writes complete.

  - A transaction only enters E-phase after the watermark advances past it.

  - E-phase (evaluation) quorum-reads and evaluates thunks from consistent
    snapshots below the watermark, lazily resolving any earlier thunks.
    Everything below the watermark is coordination-free and deterministic.

### Caveats

Nothing's perfect, and this crate is anything but:

 - This crate is very incomplete and does not work yet. Don't use it for
   anything other than experiments and toys. Recovery, reconfiguration,
   timeouts and nontrivial fault tolerance paths _definitely_ don't work.

 - It also (somewhat recklessly) attempts to combine OV's reconfiguration
   and gossip protocols into an instance of the [concorde] reconfigurable
   lattice agreement protocol. This might not even be _theoretically_ safe.

 - It is much more minimal than the full OV protocol: there's no support
   for sharding, nor the two-level peer-vs-datacenter locality organization.
   This crate treats its whole peer group as a single symmetric shard.

 - As a result, performance won't be "webscale" or anything. It will scale
   vertically if you throw cores at it, but no better, and its latency will
   always have speed-of-light WAN RTT factors in it. It's distributed for
   fault tolerance, not horizontal scaling.

 - As with OV, this crate does require partial clock synchronization. It
   doesn't need to be very tight: clock drift only causes increased
   latency as the watermarks progress as the minimum of all times; it
   doesn't affect correctness. Normal weak-NTP-level sync should be ok.

 - As with OV, Calvin, and all deterministic databases: your txns have to be
   deterministic and must have deterministic _read and write sets_. If they
   cannot have their read and write sets statically computed (eg. if they
   rely on the data to decide read and write set) you have to build slightly
   awkward multi-phase txns. The term in the literature is "reconnaisance
   queries".

### Reference

Hua Fan and Wojciech Golab. Ocean Vista: Gossip-Based Visibility Control for
Speedy Geo-Distributed Transactions. PVLDB, 12(11): 1471-1484, 2019.

DOI: <https://doi.org/10.14778/3342263.3342627>

<http://www.vldb.org/pvldb/vol12/p1471-fan.pdf>

### Name

Wikipedia:

> A water clock or clepsydra (Greek κλεψύδρα from κλέπτειν kleptein, 'to
> steal'; ὕδωρ hydor, 'water') is any timepiece by which time is measured by
> the regulated flow of liquid into (inflow type) or out from (outflow type)
> a vessel, and where the amount is then measured.


License: MIT OR Apache-2.0