Crate composable_indexes

Crate composable_indexes 

Source
Expand description

A library for in-memory collections with simple, performant and composable indexes.

§Example

use composable_indexes::{Collection, index, aggregation};

struct Person { name: String, age: u32, occupation: String }

// Define indexes on the Person struct
#[derive(composable_indexes::Index)]
#[index(Person)]
struct PersonIndex {
  by_name: index::Premap<Person, String, index::HashTable<String>>,
  by_age: index::PremapOwned<Person, u32, index::BTree<u32>>,
  by_occupation: index::Grouped<Person, String, aggregation::Count>,
}

// Create the collection.
let mut collection = Collection::new(
  PersonIndex {
    by_name: index::Premap::new(|p: &Person| &p.name, index::HashTable::new()),
    by_age: index::PremapOwned::new(|p: &Person| p.age, index::BTree::new()),
    by_occupation: index::Grouped::new(|p: &Person| &p.occupation, || aggregation::Count::new()),
  }
);

// insert items
collection.insert(Person { name: "Alice".to_string(), age: 30, occupation: "Engineer".to_string() });
collection.insert(Person { name: "Bob".to_string(), age: 25, occupation: "Designer".to_string() });

// Query by name - returns keys, use get_one to get the first match
let alice_key = collection.query(|ix| ix.by_name.get_one(&"Alice".to_string()));

// Query oldest person
let _oldest = collection.query(|ix| ix.by_age.max_one());

// Query the count of designers
let _designer_count = collection.query(|ix| ix.by_occupation.get(&"Designer".to_string()).count());

§Motivation

Rust standard library (and ecosystem) provides excellent in-memory data structures like HashMap, BTreeMap, Vec, and so on. However, if we need to index multiple fields of a data structure, we will need to build and maintain multiple data structures, and carefully write code to keep them in sync.

composable-indexes aims to solve this problem by introducing a concept called Indexes, which are composable data structures that are declared upfront and automatically kept in sync with the main collection. It’s designed to be lightweight, easy to extend, and efficient.

§Collection

At the core of composable-indexes is the Collection<T, Ix> type, which represents a collection of items of type T, indexed by a set of indexes Ix. A Collection owns the data, and the indexes hold pointers to the data in the collection. When we insert, update, or delete items in the collection, it automatically updates the indexes accordingly.

When inserting an item into a Collection, it returns a Key which can be used to refer to the inserted item later (for updates or deletions) without resorting to a query.

Warning: The collection can only guarantee the validity of the index as long as the items are not mutated in place. This should only be possible using interior mutability (e.g., Cell, RefCell, Mutex, atomics etc.). So - do not use interior mutability on the fields that are indexed. This can not only cause indexes to go out of sync, but more importantly might cause panics or undefined behaviour.

But the more interesting part is querying. Methods like Collection::query, Collection::update and Collection::delete methods accept a callback that receives a reference to the index structure, allowing us to perform efficient lookups using the indexes defined on the collection. There is also counterparts like Collection::query_keys that return the keys of the items matching the query, or Collection::query_with_keys that return both the keys and the items.

Keep reading to learn more about indexes & querying!

§Indexes

Each collection is attached to an an index. And index is a data structure that implements the Index trait - which the Collection uses to keep the index in sync with the data. The indexes can be accessed with the methods like Collection::query mentioned above, and expose methods to perform queries efficiently. The results of those queries are ordinary Rust data structures that implement QueryResult trait - which is then used by the collection to “translate” the result of the query to the actual data or perform updates/deletions.

Most used indexes are index::HashTable and index::BTree:

  • index::HashTable wraps a HashMap and provides efficient lookups by key. It’s the most commonly used index for equality lookups. Answer queries like “get all the items with field X equal to Y”.
  • index::BTree wraps a BTreeMap and provides efficient range queries (on top of equality lookups). Answer queries like “get all the items with field X is greater than A” or “get the item where the field X is the biggest”).

This brings us to the composable part of composable-indexes. This library contains “higher-order indexes” that can wrap other indexes to:

On top of those - tuples also implement the Index trait, alllowing combining multiple indexes in parallel. Alternatively - the Index derive macro can be used to define tuple-like indexes with named fields.

§Performance

composable-indexes is designed with performance in mind. The interfaces are designed to compile away, and only expose the underlying data structures. In other words, think of a Collection as a way to translate operations to the underlying index structures at compile time without adding (significant) runtime overhead.

Data structures in composable-indexes hold the data entirely in memory. Usually, the data is owned by the Collection itself, and indexes hold pointers to the data in the collection (called a Key). This means that for most queries you can expect one lookup to the index structure to obtain the pointer, and then one lookup to the collection to obtain the actual data.

§Index Performance

The common indexes (BTree, HashTable) are simply thin wrappers around std::collections::BTreeMap and std::collections::HashMap, so you can expect the same performance characteristics as those data structures. They are keyed by the input (usually a field of the stored type) and values are sets of pointers to the actual data stored in the collection.

Higher order indexes like index::Filtered, index::Premap are all zero-cost abstractions and have negligible overhead.

Important: Because of not doing bookkeeping themselves, the functions passed to higher-order indexes should be fast to compute, as they will not be cached and are computed on-the-fly. Ideally, they should be things like field accesses rather than expensive computations.

The most commonly used indexes are HashTable for equality lookups and BTree for range queries. Between those two, hashtables are the fastest. They also come with immutable counterparts (with the imbl feature) which tend to be slower, but allow cheap cloning and multi-versioning of the database.

Index TypeOperationsInsertRemoveQueryMemory
index::Keysget all keysO(1)O(1)O(n)O(n)
index::HashTablecontains, get, count distinctO(1)O(1)O(1)O(n)
index::BTreecontains, get, count distinct, range, min, maxO(log n)O(log n)O(log n)O(n)
index::SuffixTreestring searchO(k * log n) †O(k * log n) †O(log n)O(n) ‡
Aggregationscount, sum, mean, stddevO(1)O(1)O(1)O(1)

† k = length of the string

‡ Suffix trees have have a high memory footprint (even though linear), expect 5-10 times the length of the input strings.

§Aggregation Performance

All built-in aggregations are calculated iteratively, without holding the data in memory. You can expect O(1) memory and time complexity regardless of the size of the collection.

As an example, aggregations::Count simply increments and decrements a counter as items are inserted and removed, aggregations::Mean only keeps track of the sum and count and so on.

§Indexing Overhead

A Collection is simply a HashMap, and indexes are additional data structures. Hence, inserting an element into a Collection simply compiles down to inserting the element into the underlying HashMap, and then inserting pointers to the same element into each of the indexes. Hence, the overhead of adding indexes is linear in the number of indexes.

As an example, in the benchmark below we compare inserting elements into a std::collections::HashMap versus inserting the same elements into a Collection with zero, one, two, three, and four indexes. You can see that without an index, the performance is exactly the same as a HashMap, and adding an index linearly increases the insertion time.

indexing_overhead: Comparison Average time (ms) Input 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 7000.0 8000.0 9000.0 10000.0 collection_with_1_index collection_with_2_indexes collection_with_3_indexes collection_with_4_indexes collection_with_no_index hashmap_with_default_hasher hashmap_with_foldhash_hasher

§Security

As both Collection and index::HashTable index are backed by hash maps, the choice of the hash function can have a significant impact on performance. composable-indexes defaults to the default hasher of hashbrown, which is foldhash that is fast, but prone to HashDoS attacks. If you need a different hasher, they can be provided when constructing collection and the indexes. See foldhash’s README for more details: https://github.com/orlp/foldhash

Modules§

aggregation
Module providing aggregation indexes for computing aggregate values over collections.
core
Internal module containing core traits and types mostly useful for implementing your own indexes.
index
Module providing various index implementations and combinators. Includes basic indexes like BTree and HashTable, as well as combinators for transforming, grouping, and filtering indexes.
testutils

Structs§

Collection
A collection of items, with an index that is automatically kept up-to-date.
Key
Unique identifier for an item in a collection.

Traits§

Index
Types that can be used as indexes in a crate::Collection.
QueryResult
Trait for types that can be returned from queries.
QueryResultDistinct
Sealed marker trait for QueryResults that only return distinct keys.
ShallowClone
A marker trait for types that support cloning without having to copy all indexed values.

Derive Macros§

Index
Derives the Index trait for a struct where each field is itself an Index.
ShallowClone
Derives the ShallowClone trait for a struct where each field is itself a ShallowClone.