Expand description
A library for in-memory collections with simple, performant and composable indexes.
§Example
use composable_indexes::{Collection, index, aggregation};
struct Person { name: String, age: u32, occupation: String }
// Define indexes on the Person struct
#[derive(composable_indexes::Index)]
#[index(Person)]
struct PersonIndex {
by_name: index::Premap<Person, String, index::HashTable<String>>,
by_age: index::PremapOwned<Person, u32, index::BTree<u32>>,
by_occupation: index::Grouped<Person, String, aggregation::Count>,
}
// Create the collection.
let mut collection = Collection::new(
PersonIndex {
by_name: index::Premap::new(|p: &Person| &p.name, index::HashTable::new()),
by_age: index::PremapOwned::new(|p: &Person| p.age, index::BTree::new()),
by_occupation: index::Grouped::new(|p: &Person| &p.occupation, || aggregation::Count::new()),
}
);
// insert items
collection.insert(Person { name: "Alice".to_string(), age: 30, occupation: "Engineer".to_string() });
collection.insert(Person { name: "Bob".to_string(), age: 25, occupation: "Designer".to_string() });
// Query by name - returns keys, use get_one to get the first match
let alice_key = collection.query(|ix| ix.by_name.get_one(&"Alice".to_string()));
// Query oldest person
let _oldest = collection.query(|ix| ix.by_age.max_one());
// Query the count of designers
let _designer_count = collection.query(|ix| ix.by_occupation.get(&"Designer".to_string()).count());§Motivation
Rust standard library (and ecosystem) provides excellent in-memory data structures like HashMap,
BTreeMap, Vec, and so on. However, if we need to index multiple fields of a data structure,
we will need to build and maintain multiple data structures, and carefully write code to keep
them in sync.
composable-indexes aims to solve this problem by introducing a concept called Indexes, which
are composable data structures that are declared upfront and automatically kept in sync with the
main collection. It’s designed to be lightweight, easy to extend, and efficient.
§Collection
At the core of composable-indexes is the Collection<T, Ix> type, which represents
a collection of items of type T, indexed by a set of indexes Ix. A Collection owns the data,
and the indexes hold pointers to the data in the collection. When we insert, update, or delete items
in the collection, it automatically updates the indexes accordingly.
When inserting an item into a Collection, it returns a Key which can be used to refer to
the inserted item later (for updates or deletions) without resorting to a query.
Warning: The collection can only guarantee the validity of the index as long as the items are not
mutated in place. This should only be possible using interior mutability (e.g., Cell, RefCell,
Mutex, atomics etc.). So - do not use interior mutability on the fields that are indexed. This can
not only cause indexes to go out of sync, but more importantly might cause panics or undefined behaviour.
But the more interesting part is querying. Methods like Collection::query, Collection::update and Collection::delete methods accept a callback that receives a reference to the index structure, allowing us to perform efficient lookups using the indexes defined on the collection. There is also counterparts like Collection::query_keys that return the keys of the items matching the query, or Collection::query_with_keys that return both the keys and the items.
Keep reading to learn more about indexes & querying!
§Indexes
Each collection is attached to an an index. And index is a data structure that implements the Index trait - which the Collection uses to keep the index in sync with the data. The indexes can be accessed with the methods like Collection::query mentioned above, and expose methods to perform queries efficiently. The results of those queries are ordinary Rust data structures that implement QueryResult trait - which is then used by the collection to “translate” the result of the query to the actual data or perform updates/deletions.
Most used indexes are index::HashTable and index::BTree:
- index::HashTable wraps a HashMap and provides efficient lookups by key. It’s the most commonly used index for equality lookups. Answer queries like “get all the items with field X equal to Y”.
- index::BTree wraps a BTreeMap and provides efficient range queries (on top of equality lookups). Answer queries like “get all the items with field X is greater than A” or “get the item where the field X is the biggest”).
This brings us to the composable part of composable-indexes. This library contains “higher-order indexes” that
can wrap other indexes to:
- Apply the index to a specific field of the data (index::Premap)
- Apply the index to a subset of the data (index::Filtered)
- Group the data by a key and apply an index/aggregation to each group (index::Grouped)
On top of those - tuples also implement the Index trait, alllowing combining multiple indexes in parallel. Alternatively - the Index derive macro can be used to define tuple-like indexes with named fields.
§Performance
composable-indexes is designed with performance in mind. The interfaces are designed
to compile away, and only expose the underlying data structures. In other words, think
of a Collection as a way to translate operations to the underlying index structures
at compile time without adding (significant) runtime overhead.
Data structures in composable-indexes hold the data entirely in memory. Usually, the
data is owned by the Collection itself, and indexes hold pointers to the data in the
collection (called a Key). This means that for most queries you can expect one
lookup to the index structure to obtain the pointer, and then one lookup to the
collection to obtain the actual data.
§Index Performance
The common indexes (BTree, HashTable) are simply thin wrappers around
std::collections::BTreeMap and std::collections::HashMap, so you can expect the
same performance characteristics as those data structures. They are keyed by the input
(usually a field of the stored type) and values are sets of pointers to the actual
data stored in the collection.
Higher order indexes like index::Filtered, index::Premap are all zero-cost abstractions and have negligible overhead.
Important: Because of not doing bookkeeping themselves, the functions passed to higher-order indexes should be fast to compute, as they will not be cached and are computed on-the-fly. Ideally, they should be things like field accesses rather than expensive computations.
The most commonly used indexes are HashTable for equality lookups and BTree for
range queries. Between those two, hashtables are the fastest. They also come with
immutable counterparts (with the imbl feature) which tend to be slower, but allow
cheap cloning and multi-versioning of the database.
| Index Type | Operations | Insert | Remove | Query | Memory |
|---|---|---|---|---|---|
index::Keys | get all keys | O(1) | O(1) | O(n) | O(n) |
index::HashTable | contains, get, count distinct | O(1) | O(1) | O(1) | O(n) |
index::BTree | contains, get, count distinct, range, min, max | O(log n) | O(log n) | O(log n) | O(n) |
index::SuffixTree | string search | O(k * log n) † | O(k * log n) † | O(log n) | O(n) ‡ |
| Aggregations | count, sum, mean, stddev | O(1) | O(1) | O(1) | O(1) |
† k = length of the string
‡ Suffix trees have have a high memory footprint (even though linear), expect 5-10 times the length of the input strings.
§Aggregation Performance
All built-in aggregations are calculated iteratively, without holding the data in memory. You can expect O(1) memory and time complexity regardless of the size of the collection.
As an example, aggregations::Count simply increments and decrements a counter as
items are inserted and removed, aggregations::Mean only keeps track of the sum and
count and so on.
§Indexing Overhead
A Collection is simply a HashMap, and indexes are additional data structures.
Hence, inserting an element into a Collection simply compiles down to inserting
the element into the underlying HashMap, and then inserting pointers to the same
element into each of the indexes. Hence, the overhead of adding indexes is linear
in the number of indexes.
As an example, in the benchmark below we compare inserting elements into a
std::collections::HashMap versus inserting the same elements into a
Collection with zero, one, two, three, and four indexes.
You can see that without an index, the performance is exactly the same as a
HashMap, and adding an index linearly increases the insertion time.
§Security
As both Collection and index::HashTable index are backed by hash maps, the choice of the
hash function can have a significant impact on performance. composable-indexes
defaults to the default hasher of hashbrown, which is foldhash that is fast,
but prone to HashDoS attacks. If you need a different
hasher, they can be provided when constructing collection and the indexes. See foldhash’s
README for more details: https://github.com/orlp/foldhash
Modules§
- aggregation
- Module providing aggregation indexes for computing aggregate values over collections.
- core
- Internal module containing core traits and types mostly useful for implementing your own indexes.
- index
- Module providing various index implementations and combinators. Includes basic indexes like BTree and HashTable, as well as combinators for transforming, grouping, and filtering indexes.
- testutils
Structs§
- Collection
- A collection of items, with an index that is automatically kept up-to-date.
- Key
- Unique identifier for an item in a collection.
Traits§
- Index
- Types that can be used as indexes in a
crate::Collection. - Query
Result - Trait for types that can be returned from queries.
- Query
Result Distinct - Sealed marker trait for QueryResults that only return distinct keys.
- Shallow
Clone - A marker trait for types that support cloning without having to copy all indexed values.
Derive Macros§
- Index
- Derives the
Indextrait for a struct where each field is itself anIndex. - Shallow
Clone - Derives the
ShallowClonetrait for a struct where each field is itself aShallowClone.