Expand description
§In-Memory Graph Storage – Adjacency Lists, Indices, and Statistics
GraphStore is the central data structure of the Graphmind graph engine. It
holds all nodes, edges, indices, and metadata in memory, optimized for the
access patterns of a graph query engine.
§Why adjacency lists, not adjacency matrices?
An adjacency matrix for V vertices requires O(V^2) space and O(V) time to find a node’s neighbors. Real-world graphs are overwhelmingly sparse – a social network with 1 million users might have 100 million edges, but a matrix would need 10^12 cells. Adjacency lists use O(V + E) space and give O(degree) neighbor iteration – a far better fit.
§Arena allocation with MVCC versioning
Nodes and edges are stored in Vec<Vec<T>> structures (arena allocation).
The outer Vec is indexed by entity ID (acting as a dense array), and the
inner Vec holds successive MVCC versions of that entity. This layout
gives O(1) lookup by ID (direct indexing, no hash computation), excellent
cache locality for sequential scans, and natural support for snapshot reads
at any historical version. This is similar to how PostgreSQL stores tuple
versions in heap pages, but without the overhead of disk I/O.
§Sorted adjacency lists and edge_between()
The outgoing and incoming adjacency lists store Vec<(NodeId, EdgeId)>
tuples sorted by NodeId. This enables edge_between(a, b) to use
binary search with O(log d) complexity (where d = degree of the node),
which is critical for the ExpandIntoOperator that checks edge existence
between two already-bound nodes in triangle-pattern queries.
§Secondary indices: label and edge-type
label_index: HashMap<Label, HashSet<NodeId>> and
edge_type_index: HashMap<EdgeType, HashSet<EdgeId>> act as secondary
indices, analogous to B-tree indices in an RDBMS. When a Cypher query
specifies MATCH (n:Person), the engine looks up the Person entry in
label_index and scans only matching nodes, avoiding a full table scan.
We use HashMap (not BTreeMap) because we only need equality lookups
on labels, not range scans.
§ColumnStore integration (late materialization)
node_columns and edge_columns provide columnar storage for
frequently accessed properties. In the traditional row store (PropertyMap
on each node), reading one property from 1000 nodes touches 1000 scattered
HashMaps. The columnar store groups all values of the same property
contiguously, enabling vectorized reads and better CPU cache utilization.
The query engine uses late materialization: scan operators produce
lightweight Value::NodeRef(id) references instead of cloning full nodes,
and properties are resolved on demand from the column store.
§GraphStatistics for cost-based optimization
The query planner uses GraphStatistics to estimate the cardinality
(number of rows) at each stage of a query plan, choosing the plan with
the lowest estimated cost. See the struct documentation for details.
§Key Rust patterns
Vec<Vec<T>>arena: dense ID-indexed storage avoids HashMap overhead; inner Vec holds MVCC versions.HashMapvsBTreeMap:HashMapis used for indices because we need O(1) point lookups (label equality), not ordered iteration.BTreeMapwould add an unnecessary O(log n) factor.Arc:vector_indexandproperty_indexare wrapped inArc(atomic reference counting) for shared ownership across the query engine and background index-maintenance tasks.thiserror: theGraphErrorenum uses thethiserrorcrate’s derive macro to auto-generateDisplayandErrorimplementations, reducing boilerplate for error types.
§Requirements coverage
- REQ-GRAPH-001: Property graph data model
- REQ-MEM-001: In-memory storage
- REQ-MEM-003: Memory-optimized data structures
Structs§
- Graph
Statistics - Statistics about graph contents for cost-based query optimization.
- Graph
Store - In-memory graph storage engine.
- Property
Stats - Statistics about a specific property
Enums§
- Graph
Error - Errors that can occur during graph operations