1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
// Copyright 2020 Graydon Hoare <graydon@pobox.com> // Licensed under the MIT and Apache-2.0 licenses. //! # Overview //! //! This is a work-in-progress implementation of a core protocol for a minimalist //! distributed database. It strives to be as small and simple as possible while //! attempting to provide relatively challenging features: //! //! - Strict Serializability //! //! - Online Reconfiguration //! //! - Fault tolerance //! //! - High throughput //! //! The implementation is based on a simplified version of the "Ocean Vista" //! (OV) protocol, and uses its terminology wherever possible. OV combines //! replication, transaction commitment and concurrency control into a single //! protocol. //! //! ## Summary //! //! The short version of the protocol is: //! //! - Transactions are represented as deterministic thunks over snapshots. //! //! - Each transaction is assigned a globally-unique timestamp. //! //! - Transactions are separated into two phases: S-phase and E-phase. //! //! - S-phase (storage) consists of coordination-free "blind quorum-writes" //! replicating the thunks into their MVCC order on each replica. //! //! - A watermark tracking minimum transaction timestamps-being-written is //! gossiped between peers, increasing as quorum-writes complete. //! //! - A transaction only enters E-phase after the watermark advances past it. //! //! - E-phase (evaluation) quorum-reads and evaluates thunks from consistent //! snapshots below the watermark, lazily resolving any earlier thunks. //! Everything below the watermark is coordination-free and deterministic. //! //! ## Caveats //! //! Nothing's perfect, and this crate is anything but: //! //! - This crate is very incomplete and does not work yet. Don't use it for //! anything other than experiments and toys. Recovery, reconfiguration, //! timeouts and nontrivial fault tolerance paths _definitely_ don't work. //! //! - It also (somewhat recklessly) attempts to combine OV's reconfiguration //! and gossip protocols into an instance of the [concorde] reconfigurable //! lattice agreement protocol. This might not even be _theoretically_ safe. //! //! - It is much more minimal than the full OV protocol: there's no support //! for sharding, nor the two-level peer-vs-datacenter locality organization. //! This crate treats its whole peer group as a single symmetric shard. //! //! - As a result, performance won't be "webscale" or anything. It will scale //! vertically if you throw cores at it, but no better, and its latency will //! always have speed-of-light WAN RTT factors in it. It's distributed for //! fault tolerance, not horizontal scaling. //! //! - As with OV, this crate does require partial clock synchronization. It //! doesn't need to be very tight: clock drift only causes increased //! latency as the watermarks progress as the minimum of all times; it //! doesn't affect correctness. Normal weak-NTP-level sync should be ok. //! //! - As with OV, Calvin, and all deterministic databases: your txns have to be //! deterministic and must have deterministic _read and write sets_. If they //! cannot have their read and write sets statically computed (eg. if they //! rely on the data to decide read and write set) you have to build slightly //! awkward multi-phase txns. The term in the literature is "reconnaisance //! queries". //! //! ## Reference //! //! Hua Fan and Wojciech Golab. Ocean Vista: Gossip-Based Visibility Control for //! Speedy Geo-Distributed Transactions. PVLDB, 12(11): 1471-1484, 2019. //! //! DOI: <https://doi.org/10.14778/3342263.3342627> //! //! <http://www.vldb.org/pvldb/vol12/p1471-fan.pdf> //! //! ## Name //! //! Wikipedia: //! //! > A water clock or clepsydra (Greek κλεψύδρα from κλέπτειν kleptein, 'to //! > steal'; ὕδωρ hydor, 'water') is any timepiece by which time is measured by //! > the regulated flow of liquid into (inflow type) or out from (outflow type) //! > a vessel, and where the amount is then measured. //! #![allow(dead_code)] use futures::Future; use serde::{Deserialize, Serialize}; use std::{fmt::Debug, pin::Pin}; use thiserror::Error; #[derive(Error, Debug, Clone, PartialOrd, Ord, PartialEq, Eq, Hash, Serialize, Deserialize)] pub enum Error { #[error("Txn was aborted")] TxnAbort, #[error("Requested key was neither read nor written")] MissingKey, #[error("Replication failed")] ReplicationFailed, #[error("Read failed")] ReadFailed, #[error("Inconsistent replicas")] InconsistentReplicas, #[error("Too few replicas")] TooFewReplicas, #[error("Networking error")] NetworkingError, #[error("Unexpected response")] UnexpectedResponse, } impl From<edelcrantz::Error> for Error { fn from(_: edelcrantz::Error) -> Self { Error::NetworkingError } } mod agreement; mod database; mod globaltime; mod language; mod network; mod quorum; mod replication; mod storage; mod tidmgr; mod transaction; mod watermarks; // We define a BoxFuture-like wrapper type here and wrap most of our nontrivial // async fn calls in it, for compilation and code footprint reasons: it costs an // extra heap allocation per async call, but means the library compiles faster, // can handle recursive futures, and doesn't require compiler pragmas to // override the maximum allowed type size. // // We don't use the standard BoxFuture type because we want our boxed futures to // also implement Sync, which the standard one doesn't. type SyncBoxFuture<T> = Pin<Box<dyn Future<Output = T> + 'static + Send + Sync>>; pub use database::Database; pub use globaltime::GlobalTime; pub use language::{ExtVal, Lang}; pub use network::PeerID; pub(crate) use quorum::{majority_quorum, super_quorum}; pub use replication::ReplicationTag; pub use storage::{Entry, KeyVer, Store}; pub(crate) use tidmgr::TidMgr; pub use tidmgr::{Clock, RealClock, TestClock}; pub(crate) use transaction::Txn; pub use watermarks::Sdw; pub(crate) use watermarks::{ GroupWatermarks, RWatermark, ServerWatermarksLD, ServerWatermarksLE, ServerWatermarksLEExt, Srw, Svw, VWatermark, };