1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
// Copyright 2020 Graydon Hoare <graydon@pobox.com>
// Licensed under the MIT and Apache-2.0 licenses.

//! # Overview
//!
//! This is a work-in-progress implementation of a core protocol for a minimalist
//! distributed database. It strives to be as small and simple as possible while
//! attempting to provide relatively challenging features:
//!
//!   - Strict Serializability
//!
//!   - Online Reconfiguration
//!
//!   - Fault tolerance
//!
//!   - High throughput
//!
//! The implementation is based on a simplified version of the "Ocean Vista"
//! (OV) protocol, and uses its terminology wherever possible. OV combines
//! replication, transaction commitment and concurrency control into a single
//! protocol.
//!
//! ## Summary
//!
//! The short version of the protocol is:
//!
//!   - Transactions are represented as deterministic thunks over snapshots.
//!
//!   - Each transaction is assigned a globally-unique timestamp.
//!
//!   - Transactions are separated into two phases: S-phase and E-phase.
//!
//!   - S-phase (storage) consists of coordination-free "blind quorum-writes"
//!     replicating the thunks into their MVCC order on each replica.
//!
//!   - A watermark tracking minimum transaction timestamps-being-written is
//!     gossiped between peers, increasing as quorum-writes complete.
//!
//!   - A transaction only enters E-phase after the watermark advances past it.
//!
//!   - E-phase (evaluation) quorum-reads and evaluates thunks from consistent
//!     snapshots below the watermark, lazily resolving any earlier thunks.
//!     Everything below the watermark is coordination-free and deterministic.
//!
//! ## Caveats
//!
//! Nothing's perfect, and this crate is anything but:
//!
//!  - This crate is very incomplete and does not work yet. Don't use it for
//!    anything other than experiments and toys. Recovery, reconfiguration,
//!    timeouts and nontrivial fault tolerance paths _definitely_ don't work.
//!
//!  - It also (somewhat recklessly) attempts to combine OV's reconfiguration
//!    and gossip protocols into an instance of the [concorde] reconfigurable
//!    lattice agreement protocol. This might not even be _theoretically_ safe.
//!
//!  - It is much more minimal than the full OV protocol: there's no support
//!    for sharding, nor the two-level peer-vs-datacenter locality organization.
//!    This crate treats its whole peer group as a single symmetric shard.
//!
//!  - As a result, performance won't be "webscale" or anything. It will scale
//!    vertically if you throw cores at it, but no better, and its latency will
//!    always have speed-of-light WAN RTT factors in it. It's distributed for
//!    fault tolerance, not horizontal scaling.
//!
//!  - As with OV, this crate does require partial clock synchronization. It
//!    doesn't need to be very tight: clock drift only causes increased
//!    latency as the watermarks progress as the minimum of all times; it
//!    doesn't affect correctness. Normal weak-NTP-level sync should be ok.
//!
//!  - As with OV, Calvin, and all deterministic databases: your txns have to be
//!    deterministic and must have deterministic _read and write sets_. If they
//!    cannot have their read and write sets statically computed (eg. if they
//!    rely on the data to decide read and write set) you have to build slightly
//!    awkward multi-phase txns. The term in the literature is "reconnaisance
//!    queries".
//!
//! ## Reference
//!
//! Hua Fan and Wojciech Golab. Ocean Vista: Gossip-Based Visibility Control for
//! Speedy Geo-Distributed Transactions. PVLDB, 12(11): 1471-1484, 2019.
//!
//! DOI: <https://doi.org/10.14778/3342263.3342627>
//!
//! <http://www.vldb.org/pvldb/vol12/p1471-fan.pdf>
//!
//! ## Name
//!
//! Wikipedia:
//!
//! > A water clock or clepsydra (Greek κλεψύδρα from κλέπτειν kleptein, 'to
//! > steal'; ὕδωρ hydor, 'water') is any timepiece by which time is measured by
//! > the regulated flow of liquid into (inflow type) or out from (outflow type)
//! > a vessel, and where the amount is then measured.
//!

#![allow(dead_code)]

use futures::Future;
use serde::{Deserialize, Serialize};
use std::{fmt::Debug, pin::Pin};
use thiserror::Error;

#[derive(Error, Debug, Clone, PartialOrd, Ord, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub enum Error {
    #[error("Txn was aborted")]
    TxnAbort,
    #[error("Requested key was neither read nor written")]
    MissingKey,
    #[error("Replication failed")]
    ReplicationFailed,
    #[error("Read failed")]
    ReadFailed,
    #[error("Inconsistent replicas")]
    InconsistentReplicas,
    #[error("Too few replicas")]
    TooFewReplicas,
    #[error("Networking error")]
    NetworkingError,
    #[error("Unexpected response")]
    UnexpectedResponse,
}

impl From<edelcrantz::Error> for Error {
    fn from(_: edelcrantz::Error) -> Self {
        Error::NetworkingError
    }
}

mod agreement;
mod database;
mod globaltime;
mod language;
mod network;
mod quorum;
mod replication;
mod storage;
mod tidmgr;
mod transaction;
mod watermarks;

// We define a BoxFuture-like wrapper type here and wrap most of our nontrivial
// async fn calls in it, for compilation and code footprint reasons: it costs an
// extra heap allocation per async call, but means the library compiles faster,
// can handle recursive futures, and doesn't require compiler pragmas to
// override the maximum allowed type size.
//
// We don't use the standard BoxFuture type because we want our boxed futures to
// also implement Sync, which the standard one doesn't.
type SyncBoxFuture<T> = Pin<Box<dyn Future<Output = T> + 'static + Send + Sync>>;

pub use database::Database;
pub use globaltime::GlobalTime;
pub use language::{ExtVal, Lang};
pub use network::PeerID;
pub(crate) use quorum::{majority_quorum, super_quorum};
pub use replication::ReplicationTag;
pub use storage::{Entry, KeyVer, Store};
pub(crate) use tidmgr::TidMgr;
pub use tidmgr::{Clock, RealClock, TestClock};
pub(crate) use transaction::Txn;
pub use watermarks::Sdw;
pub(crate) use watermarks::{
    GroupWatermarks, RWatermark, ServerWatermarksLD, ServerWatermarksLE, ServerWatermarksLEExt,
    Srw, Svw, VWatermark,
};