1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
//! Here's a look at where sled is at, and where it's going architecturally. The system is very much under active development, and we have a ways to go. If specific areas are interesting to you, I'd love to [work together](https://github.com/spacejam/sled/blob/master/CONTRIBUTING.md)! If your business has a need for particular items below, you can [fund development of particular features](https://opencollective.com/sled). //! //! People face unnecessary hardship when working with existing embedded //! databases. They tend to have sharp performance trade-offs, are difficult to //! tune, have unclear consistency guarantees, and are generally inflexible. //! Facebook uses distributed machine learning to find configurations that //! achieve great performance for specific workloads on rocksdb. Most engineers //! don't have access to that kind of infrastructure. We would like to build //! sled so that it can be optimized using simple local methods, with as little //! user input as possible, and in many cases exceed the performance of popular //! systems today. //! //! This is how we aim to improve the situation: //! //! * low configuration required to get great performance on a wide variety of //! workloads by using a modified Bw-Tree and keeping workload metrics that //! allow us to self-tune //! * first-class subscriber semantics for operations on specified prefixes of //! keys //! * first-class programmatic access to the binary replication stream //! * serializable transactions //! //! ### Indexing //! //! sled started as an implementation of a Bw-Tree, but over time has moved away //! from certain architectural aspects that have been difficult to tune. The //! first thing to be dropped from the original Bw-Tree design was the in-memory //! representation of a node as a long linked list of updates, terminating in //! the actual base tree node. It was found that by leaning into atomic //! reference counting, it became quite performant to perform RCU on entire tree //! nodes for every update, because a tree node only needs 2 allocations (the //! node itself, and a vector of children). All other items are protected by //! their own rust `Arc`. This made reads dramatically faster, and allowed them //! to avoid allocations that were required previously to build up a dynamic //! "view" over a chain of partial updates. //! //! A current area of effort is to store tree nodes as a Rust //! `RwLock<Arc<Node>>`. The Rust `Arc` has a cool method called `make_mut` //! which can provide mutable access to an Arc if the strong count is 1, or make //! a clone if it isn't and then provide a mutable reference to the local clone. //! This will allow us to perform even fewer allocations and avoid RCU on the //! tree nodes in cases of lower contention. Nesting an `Arc` in a lock //! structure allows for an interesting "snapshot read" semantic that allows //! writers not to block on readers. It is a middle ground between a `RwLock` //! and RCU that trades lower memory pressure for occasional blocking when a //! writer is holding a writer lock. This is expected to be a fairly low cost, //! but benchmarks have not yet been produced for this prospective architecture. //! //! The merge and split strategies are kept from the Bw-Tree, but this might be //! switched to using pagecache-level transactions once a cicada-like //! transaction protocol is implemented on top of it. //! //! * [The Bw-Tree: A B-tree for New Hardware //! Platforms](https://15721.courses.cs.cmu.edu/spring2018/papers/08-oltpindexes1/bwtree-icde2013.pdf) //! * [Building a Bw-Tree Takes More Than Just Buzz Words](http://www.cs.cmu.edu/~huanche1/publications/open_bwtree.pdf) //! //! ### Caching //! //! sled uses a pagecache that is based on LLAMA. This lets us write small //! updates to pages without rewriting the entire page, achieving low write //! amplification. Flash storage lets us scatter random reads in parallel, so to //! read a logical page, we may read several fragments and collect them in //! memory. The pagecache can be used to back any high level structure, and //! provides a lock-free interface that supports RCU-style access patterns. When //! the number of page deltas reaches a certain length, we squish the page //! updates into a single blob. //! //! The caching is currently pretty naive. We use 256 cache shards by default. Each cache shard is a simple LRU cache implemented as a doubly-linked list protected by a `Mutex`. Future directions may take inspiration from ZFS's adaptive replacement cache, which will give us scan and thrash resistance. See [#65](https://github.com/spacejam/sled/issues/65). //! //! * [LLAMA: A Cache/Storage Subsystem for Modern Hardware](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/llama-vldb2013.pdf) //! * [Adaptive replacement cache](https://en.wikipedia.org/w/index.php?title=Adaptive_replacement_cache&oldid=865482923) //! //! ### Concurrency Control //! //! sled supports point reads and writes in serializable transactions across //! multiple trees. This is fairly limited, and does not yet use a //! high-performance concurrency control mechanism. In order to support scans, //! we need to be able to catch phantom conflicts. To do this, we are taking //! some inspiration from Cicada, in terms of how they include index nodes in //! transactions, providing a really nice way to materialize conflicts relating //! to phantoms. sled has an ID generator built into it now, accessible from the //! `generate_id` method on `Tree`. This can churn out 75-125 million unique //! monotonic ID's per second on a macbook pro, so we may not need to adopt //! Cicada's distributed timestamp generation techniques for a long time. We //! will be using Cicada's approach to adaptive validation, causing early aborts //! when higher contention is detected. //! //! * [Cicada: Dependably Fast Multi-Core In-Memory Transactions](http://15721.courses.cs.cmu.edu/spring2018/papers/06-mvcc2/lim-sigmod2017.pdf) //! //! ### Storage //! //! sled splits the main storage file into fixed-sized segments. We track which //! pages live in which segments. A page may live in several segments, because //! we support writing partial updates to a page with our LLAMA-like approach. //! When a page with several fragments is squished together, we mark the page as //! freed from the previous segments. When a segment reaches a configurable low //! threshold of live pages, we start moving the remaining pages to other //! segments so that underutilized segments can be reused, and we generally keep //! the amount of fragmentation in the system controlled. //! //! As of July 2019, sled is naive about where it puts rewritten pages. Future directions will separate base pages from page deltas, and possibly have generational considerations. See [#450](https://github.com/spacejam/sled/issues/450). Also, when values reach a particularly large size, it no longer makes sense to inline them in leaf nodes of the tree. Taking a cue from `WiscKey`, we can eventually split these out, but we can be much more fine grained about placement strategy over time. Generally, being smart about rewriting and defragmentation is where sled may carve out the largest performance gains over existing production and research systems. //! //! * [The Design and Implementation of a Log-Structured File System](https://people.eecs.berkeley.edu/~brewer/cs262/LFS.pdf) //! * [`WiscKey`: Separating Keys from Values in SSD-conscious Storage](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf) //! * [Monkey: Optimal Navigable Key-Value Store](http://stratos.seas.harvard.edu/files/stratos/files/monkeykeyvaluestore.pdf) //! * [Designing Access Methods: The RUM Conjecture](https://stratos.seas.harvard.edu/files/stratos/files/rum.pdf) //! * [The Unwritten Contract of Solid State Drives](http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf) //! * [The five-minute rule twenty years later, and how flash memory changes the //! rules](http://www.cs.cmu.edu/~damon2007/pdf/graefe07fiveminrule.pdf) //! * [An Efficient Memory-Mapped Key-Value Store for `FlashStorage`](http://www.exanest.eu/pub/SoCC18_efficient_kv_store.pdf) //! * [Generalized File System Dependencies](http://featherstitch.cs.ucla.edu/publications/featherstitch-sosp07.pdf) //! //! ### Replication //! //! We want to give database implementors great tools for replicating their data //! backed by sled. We will provide first-class binary replication stream //! access, as well as subscriber to high level tree updates that happen on //! specified prefixes. These updates should be witnessed in the same order that //! they appear in the log by all consumers. //! //! We will likely include a default replication implementation, based either on //! raft or harpoon (raft but with leases instead of a paxos register-based //! leadership mechanism to protect against bouncing leadership in the presence //! of partitions). Additionally, we can get nice throughput gains over vanilla //! raft by separating the concerns of block replication and consensus on //! metadata. Blocks can be replicated in a more fragmented + p2p-like manner, //! with HOL-blocking-prone consensus being run on ordering of said blocks. This //! pushes a bit more complexity into `RequestVotes` compared to vanilla raft, //! but allows us to increase throughput a bit. //! //! ### Reclamation //! //! We use epoch-based reclamation to ensure that we don't free memory until any //! possible witnessing threads are done with their work. This is the mechanism //! that lets us return zero-copy to values in our pagecache for tree gets. //! //! Right now we use crossbeam-epoch for this. We may create a shallow fork //! (gladly contributed upstream if the maintainers are interested) that allows //! different kinds of workloads to bound the amount of garbage that they clean //! up, possibly punting more cleanups to a threadpool and operations that seem //! to prioritize throughput rather than latency. //! //! Possible future directions include using something like //! quiescent-state-based-reclamation, but we need to study more before //! considering alternative approaches. //! //! * [Comparative Performance of Memory Reclamation Strategies for Lock-free and Concurrently-readable Data Structures](http://www.cs.utoronto.ca/~tomhart/papers/tomhart_thesis.pdf) //! //! ### Checkpointing //! //! Sled has an extremely naive checkpoint strategy. It periodically takes the //! last snapshot, scans the segments in the log with an LSN higher than last //! LSN applied to the snapshot, building a snapshot from the segments it reads. //! A snapshot is effectively a CRDT, because it can use the LSN number on read //! messages as a last-write-wins register. It is currently the same mechanism //! as the recovery mechanism, where the data is read directly off the disk and //! page metadata is stored in a snapshot that is updated. The snapshot is //! entirely an optimization for recovery, and can be deleted without impacting //! recovery correctness. //! //! We are moving to a CRDT-like snapshot recovery technique, and we can easily //! parallelize recovery up until the "safety buffer" for the last few segments //! of the log. //! //! We would also like to move toward the delta-checkpoint model used in //! Hekaton, as it would allow us to further parallelize generation of //! checkpoint information. //! //! ### Misc Considerations //! //! * [How to Architect a Query Compiler, Revisited](https://www.cs.purdue.edu/homes/rompf/papers/tahboub-sigmod18.pdf) //! * shows that we can compile queries without resorting to complex //! implementations by utilizing Futamura projections //! * [CMU 15-721 (Spring 2018) Advanced Database Systems](https://15721.courses.cs.cmu.edu/spring2018/schedule.html) //! * a wonderful overview of the state of the art in various database //! topics. start here if you want to contribute deeply and don't know //! where to begin! //! * [Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask](http://sigops.org/s/conferences/sosp/2013/papers/p33-david.pdf) //! * suggests that we should eventually aim for an approach that is //! shared-nothing across sockets, but lock-free within them