Skip to main content

Module store

Module store 

Source
Expand description

§In-Memory Graph Storage – Adjacency Lists, Indices, and Statistics

GraphStore is the central data structure of the Graphmind graph engine. It holds all nodes, edges, indices, and metadata in memory, optimized for the access patterns of a graph query engine.

§Why adjacency lists, not adjacency matrices?

An adjacency matrix for V vertices requires O(V^2) space and O(V) time to find a node’s neighbors. Real-world graphs are overwhelmingly sparse – a social network with 1 million users might have 100 million edges, but a matrix would need 10^12 cells. Adjacency lists use O(V + E) space and give O(degree) neighbor iteration – a far better fit.

§Arena allocation with MVCC versioning

Nodes and edges are stored in Vec<Vec<T>> structures (arena allocation). The outer Vec is indexed by entity ID (acting as a dense array), and the inner Vec holds successive MVCC versions of that entity. This layout gives O(1) lookup by ID (direct indexing, no hash computation), excellent cache locality for sequential scans, and natural support for snapshot reads at any historical version. This is similar to how PostgreSQL stores tuple versions in heap pages, but without the overhead of disk I/O.

§Sorted adjacency lists and edge_between()

The outgoing and incoming adjacency lists store Vec<(NodeId, EdgeId)> tuples sorted by NodeId. This enables edge_between(a, b) to use binary search with O(log d) complexity (where d = degree of the node), which is critical for the ExpandIntoOperator that checks edge existence between two already-bound nodes in triangle-pattern queries.

§Secondary indices: label and edge-type

label_index: HashMap<Label, HashSet<NodeId>> and edge_type_index: HashMap<EdgeType, HashSet<EdgeId>> act as secondary indices, analogous to B-tree indices in an RDBMS. When a Cypher query specifies MATCH (n:Person), the engine looks up the Person entry in label_index and scans only matching nodes, avoiding a full table scan. We use HashMap (not BTreeMap) because we only need equality lookups on labels, not range scans.

§ColumnStore integration (late materialization)

node_columns and edge_columns provide columnar storage for frequently accessed properties. In the traditional row store (PropertyMap on each node), reading one property from 1000 nodes touches 1000 scattered HashMaps. The columnar store groups all values of the same property contiguously, enabling vectorized reads and better CPU cache utilization. The query engine uses late materialization: scan operators produce lightweight Value::NodeRef(id) references instead of cloning full nodes, and properties are resolved on demand from the column store.

§GraphStatistics for cost-based optimization

The query planner uses GraphStatistics to estimate the cardinality (number of rows) at each stage of a query plan, choosing the plan with the lowest estimated cost. See the struct documentation for details.

§Key Rust patterns

  • Vec<Vec<T>> arena: dense ID-indexed storage avoids HashMap overhead; inner Vec holds MVCC versions.
  • HashMap vs BTreeMap: HashMap is used for indices because we need O(1) point lookups (label equality), not ordered iteration. BTreeMap would add an unnecessary O(log n) factor.
  • Arc: vector_index and property_index are wrapped in Arc (atomic reference counting) for shared ownership across the query engine and background index-maintenance tasks.
  • thiserror: the GraphError enum uses the thiserror crate’s derive macro to auto-generate Display and Error implementations, reducing boilerplate for error types.

§Requirements coverage

  • REQ-GRAPH-001: Property graph data model
  • REQ-MEM-001: In-memory storage
  • REQ-MEM-003: Memory-optimized data structures

Structs§

GraphStatistics
Statistics about graph contents for cost-based query optimization.
GraphStore
In-memory graph storage engine.
PropertyStats
Statistics about a specific property

Enums§

GraphError
Errors that can occur during graph operations

Type Aliases§

GraphResult